[e2e] Add nightly e2e test for submitting examples to flink standalone cluster by matrixsparse · Pull Request #708 · apache/flink-agents

matrixsparse · 2026-05-26T17:08:41Z

Purpose of change

Add automated e2e test for submitting Java/Python quickstart examples to a Flink standalone cluster, replacing the current manual verification process before each release.

Closes #642

Changes

e2e-test/test-scripts/test_submit_examples_to_flink.sh: Test script that installs Flink via install.sh, starts a standalone cluster, submits all 6 examples (3 Java + 3 Python), verifies submission success, and cleans up.
.github/workflows/nightly-e2e.yml: Nightly GitHub Actions workflow that runs the test daily at UTC 00:00, with manual trigger support.

Key design decisions

Uses tools/install.sh --non-interactive (from [tools]Import Wizard for Installation Setup #599) for Flink installation
Validates job submission success (not full execution), since examples depend on LLM APIs
Each example tested independently; one failure doesn't block others
Flink logs archived as artifacts on failure for debugging

matrixsparse · 2026-05-26T17:22:12Z

Hi @wenjin272, this PR implements the CI pipeline for #642 as discussed. Could you PTAL when you have time?

weiqingy

Thanks for taking this on — script reads cleanly. A few questions inline.

weiqingy · 2026-05-28T04:30:42Z

+on:
+  schedule:
+    - cron: '0 0 * * *'
+  workflow_dispatch:


Nightly + manual dispatch means a regression in examples/**, python/flink_agents/examples/**, or tools/install.sh can sit undetected for up to 24h. Would a path-filtered pull_request: trigger for those paths make sense here, with the cron staying as the safety net for transitive-dep changes? The Flink download + full build is non-trivial wall time per PR, so the nightly-only choice is defensible too — curious which trade-off you prefer.

Agreed. added a path-filtered pull_request trigger for those paths. The cron stays as the safety net for transitive-dep changes. The path filter is narrow enough that most PRs won't trigger it, so wall-time cost is acceptable.

weiqingy · 2026-05-28T04:30:42Z

+            failed=$((failed + 1))
+        fi
+    done
+    printf "\nTotal: %d   Passed: %d   Failed: %d\n" "$total" "$passed" "$failed"


If install_flink, build_project, stage_dist_jars, or start_cluster dies under set -e, no result is ever recorded, so print_summary walks an empty RESULT_NAMES and prints Total: 0 Passed: 0 Failed: 0 before cleanup propagates the original non-zero exit code. The CI job still fails on the exit code, but a person scanning the log sees a "zero failures" summary right before the red X, which is misleading when triaging a 45-minute nightly run.

One way it could read, if useful:

if (( total == 0 )); then log_error "Test setup failed before any example was submitted" return fi

right above the existing if (( failed > 0 )) check.

Fixed exactly as you suggested.

xintongsong

Thanks for working on this, @matrixsparse . It's a good idea to test with the example jobs nightly.

I'm not sure about only validates the job submission success. I think currently all example jobs can run with local LLMs in Ollama. That shouldn't be a problem against verifying the full execution. Did I miss anything?

xintongsong · 2026-05-31T08:01:34Z

+    log_ok "Staged: $(basename "$flink_jar")"
+}
+
+package_examples() {


I think build_project should have already built the examples. We should not need to re-build them.

Good catch. Removed the redundant package_examples().

You're right. I've updated the script to install Ollama, pull qwen3:8b, and wait for each job to reach FINISHED status instead of just verifying submission.

In addition to verifying the jobs reaching the FINISH status, I think we can also check for the error logs to identify if the job is running properly. Flink's e2e test already have it and we may copy / reuse those approaches. See test-scripts/common.sh.

Thanks for the suggestion! I've added a check_logs_for_errors() function that scans TaskManager/JobManager logs for exceptions after job completion, inspired by Flink's test-scripts/common.sh.

xintongsong · 2026-05-31T08:21:58Z

+    log_section "Step 6: submit Java examples"
+    submit_java_example "org.apache.flink.agents.examples.ReActAgentExample"
+    submit_java_example "org.apache.flink.agents.examples.WorkflowSingleAgentExample"
+    submit_java_example "org.apache.flink.agents.examples.WorkflowMultipleAgentExample"
+
+    log_section "Step 7: submit Python examples"
+    submit_python_example "$ROOT_DIR/python/flink_agents/examples/quickstart/react_agent_example.py"
+    submit_python_example "$ROOT_DIR/python/flink_agents/examples/quickstart/workflow_single_agent_example.py"
+    submit_python_example "$ROOT_DIR/python/flink_agents/examples/quickstart/workflow_multiple_agent_example.py"


Is it intended to not cover all the example jobs?

Yes, intentional. All 6 quickstart examples (3 Java + 3 Python) are covered. The RAG examples (python/flink_agents/examples/rag/) are excluded because they require a vector store and an embedding model that aren't provisioned in this CI setup. Added a comment in the script explaining this. We can add RAG coverage in a follow-up once vector store infrastructure is available in CI.

I think there are 5 examples in Java, and 6 examples in Python (5 in quickstart/ and 1 in rag/). And there could be more in future. Is is possible to iterate over the example directory and submit everything it finds? (Might need to reorganize the example directory to follow certain pattern.)

As for the rag example, it uses a local ollama embedding model and a local chroma vector store, so there should be no problem running it locally in CI.

Done. Examples are now auto-discovered (*Example.class / *_example.py), including RAG. CI uses qwen3:0.6b aliased to expected model names — configurable via OLLAMA_CHAT_MODEL env var.

wenjin272 · 2026-06-09T07:33:06Z

The nightly e2e tests threw the following exception:

Caused by: java.lang.annotation.IncompleteAnnotationException: org.apache.flink.agents.api.annotation.Action missing element listenEvents
    at java.base/sun.reflect.annotation.AnnotationInvocationHandler.invoke(AnnotationInvocationHandler.java:83)
    at com.sun.proxy.$Proxy21.listenEvents(Unknown Source)
    at org.apache.flink.agents.plan.AgentPlan.extractActionsFromAgent(AgentPlan.java:349)
    at org.apache.flink.agents.plan.AgentPlan.<init>(AgentPlan.java:138)
    at 
org.apache.flink.agents.runtime.env.RemoteExecutionEnvironment$RemoteAgentBuilder.apply(RemoteExecutionEnvironment.java:177)

In 0.3 we renamed the Action decorator/annotation parameter from listenEvents to listenEventTypes. However, the flink-agents that install.sh pulls into the Flink cluster is 0.2.1, so submitting a 0.3-based example to a standalone cluster causes a version mismatch.

I think we should replace the flink-agents in the cluster with the jar built from the current codebase. Concretely, we can first use install.sh to install Flink and flink-agents, then build flink-agents from the current branch and replace the flink-agents jar in FLINK_HOME/lib.

Since this issue is primarily aimed at testing the submission of the quickstart example, and version 0.3 has already entered code freeze, I am changing the fixVersion of this issue from 0.3.0 to 0.4.0. In version 0.4, we will focus on refining the maturity and stability of existing capabilities.

…e cluster

matrixsparse · 2026-06-09T16:22:28Z

@wenjin272 Good catch, thanks for the analysis! Fixed — the script now removes any pre-existing flink-agents-dist-*.jar from FLINK_HOME/lib/ before copying the freshly built one, ensuring the cluster always uses the jar from the current branch. Also adjusted the test to verify successful submission (job ID obtained) as the primary pass criterion, given the lightweight CI model limitations.

matrixsparse force-pushed the feature/e2e-test-flink-standalone branch from 8189bc8 to 704e45c Compare May 26, 2026 17:23

github-actions Bot added doc-label-missing The Bot applies this label either because none or multiple labels were provided. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels May 26, 2026

weiqingy reviewed May 28, 2026

View reviewed changes

xintongsong added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing The Bot applies this label either because none or multiple labels were provided. labels May 31, 2026

xintongsong reviewed May 31, 2026

View reviewed changes

matrixsparse force-pushed the feature/e2e-test-flink-standalone branch 4 times, most recently from b1de619 to 8e705b9 Compare June 7, 2026 03:40

wenjin272 added fixVersion/0.4.0 and removed fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. labels Jun 9, 2026

matrixsparse force-pushed the feature/e2e-test-flink-standalone branch 2 times, most recently from f18b52d to e1182cd Compare June 9, 2026 15:18

[e2e] Add nightly e2e test for submitting examples to flink standalon…

75ac60d

…e cluster

matrixsparse force-pushed the feature/e2e-test-flink-standalone branch from e1182cd to 75ac60d Compare June 9, 2026 15:40

wenjin272 mentioned this pull request Jun 10, 2026

[python] Normalize built-in tool-call context for checkpoint safety #828

Merged

3 tasks

weiqingy mentioned this pull request Jun 11, 2026

[Tech Debt] Add end-to-end checkpoint-recovery test for built-in tool-context #836

Open

Conversation

matrixsparse commented May 26, 2026

Purpose of change

Changes

Key design decisions

Uh oh!

matrixsparse commented May 26, 2026

Uh oh!

weiqingy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matrixsparse May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xintongsong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenjin272 commented Jun 9, 2026

Uh oh!

matrixsparse commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

weiqingy left a comment •

edited

Loading

matrixsparse May 31, 2026 •

edited

Loading