Skip to content

[e2e] Add nightly e2e test for submitting examples to flink standalone cluster#708

Open
matrixsparse wants to merge 1 commit into
apache:mainfrom
matrixsparse:feature/e2e-test-flink-standalone
Open

[e2e] Add nightly e2e test for submitting examples to flink standalone cluster#708
matrixsparse wants to merge 1 commit into
apache:mainfrom
matrixsparse:feature/e2e-test-flink-standalone

Conversation

@matrixsparse

Copy link
Copy Markdown
Contributor

Purpose of change

Add automated e2e test for submitting Java/Python quickstart examples to a Flink standalone cluster, replacing the current manual verification process before each release.

Closes #642

Changes

  • e2e-test/test-scripts/test_submit_examples_to_flink.sh: Test script that installs Flink via install.sh, starts a standalone cluster, submits all 6 examples (3 Java + 3 Python), verifies submission success, and cleans up.
  • .github/workflows/nightly-e2e.yml: Nightly GitHub Actions workflow that runs the test daily at UTC 00:00, with manual trigger support.

Key design decisions

  • Uses tools/install.sh --non-interactive (from [tools]Import Wizard for Installation Setup #599) for Flink installation
  • Validates job submission success (not full execution), since examples depend on LLM APIs
  • Each example tested independently; one failure doesn't block others
  • Flink logs archived as artifacts on failure for debugging

@matrixsparse

Copy link
Copy Markdown
Contributor Author

Hi @wenjin272, this PR implements the CI pipeline for #642 as discussed. Could you PTAL when you have time?

@matrixsparse matrixsparse force-pushed the feature/e2e-test-flink-standalone branch from 8189bc8 to 704e45c Compare May 26, 2026 17:23
@github-actions github-actions Bot added doc-label-missing The Bot applies this label either because none or multiple labels were provided. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels May 26, 2026

@weiqingy weiqingy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on — script reads cleanly. A few questions inline.

on:
schedule:
- cron: '0 0 * * *'
workflow_dispatch:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nightly + manual dispatch means a regression in examples/**, python/flink_agents/examples/**, or tools/install.sh can sit undetected for up to 24h. Would a path-filtered pull_request: trigger for those paths make sense here, with the cron staying as the safety net for transitive-dep changes? The Flink download + full build is non-trivial wall time per PR, so the nightly-only choice is defensible too — curious which trade-off you prefer.

@matrixsparse matrixsparse May 31, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. added a path-filtered pull_request trigger for those paths. The cron stays as the safety net for transitive-dep changes. The path filter is narrow enough that most PRs won't trigger it, so wall-time cost is acceptable.

failed=$((failed + 1))
fi
done
printf "\nTotal: %d Passed: %d Failed: %d\n" "$total" "$passed" "$failed"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If install_flink, build_project, stage_dist_jars, or start_cluster dies under set -e, no result is ever recorded, so print_summary walks an empty RESULT_NAMES and prints Total: 0 Passed: 0 Failed: 0 before cleanup propagates the original non-zero exit code. The CI job still fails on the exit code, but a person scanning the log sees a "zero failures" summary right before the red X, which is misleading when triaging a 45-minute nightly run.

One way it could read, if useful:

if (( total == 0 )); then
    log_error "Test setup failed before any example was submitted"
    return
fi

right above the existing if (( failed > 0 )) check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed exactly as you suggested.

@xintongsong xintongsong added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing The Bot applies this label either because none or multiple labels were provided. labels May 31, 2026

@xintongsong xintongsong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @matrixsparse . It's a good idea to test with the example jobs nightly.

I'm not sure about only validates the job submission success. I think currently all example jobs can run with local LLMs in Ollama. That shouldn't be a problem against verifying the full execution. Did I miss anything?

log_ok "Staged: $(basename "$flink_jar")"
}

package_examples() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think build_project should have already built the examples. We should not need to re-build them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Removed the redundant package_examples().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I've updated the script to install Ollama, pull qwen3:8b, and wait for each job to reach FINISHED status instead of just verifying submission.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to verifying the jobs reaching the FINISH status, I think we can also check for the error logs to identify if the job is running properly. Flink's e2e test already have it and we may copy / reuse those approaches. See test-scripts/common.sh.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've added a check_logs_for_errors() function that scans TaskManager/JobManager logs for exceptions after job completion, inspired by Flink's test-scripts/common.sh.

Comment on lines +328 to +336
log_section "Step 6: submit Java examples"
submit_java_example "org.apache.flink.agents.examples.ReActAgentExample"
submit_java_example "org.apache.flink.agents.examples.WorkflowSingleAgentExample"
submit_java_example "org.apache.flink.agents.examples.WorkflowMultipleAgentExample"

log_section "Step 7: submit Python examples"
submit_python_example "$ROOT_DIR/python/flink_agents/examples/quickstart/react_agent_example.py"
submit_python_example "$ROOT_DIR/python/flink_agents/examples/quickstart/workflow_single_agent_example.py"
submit_python_example "$ROOT_DIR/python/flink_agents/examples/quickstart/workflow_multiple_agent_example.py"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended to not cover all the example jobs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, intentional. All 6 quickstart examples (3 Java + 3 Python) are covered. The RAG examples (python/flink_agents/examples/rag/) are excluded because they require a vector store and an embedding model that aren't provisioned in this CI setup. Added a comment in the script explaining this. We can add RAG coverage in a follow-up once vector store infrastructure is available in CI.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 5 examples in Java, and 6 examples in Python (5 in quickstart/ and 1 in rag/). And there could be more in future. Is is possible to iterate over the example directory and submit everything it finds? (Might need to reorganize the example directory to follow certain pattern.)

As for the rag example, it uses a local ollama embedding model and a local chroma vector store, so there should be no problem running it locally in CI.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Examples are now auto-discovered (*Example.class / *_example.py), including RAG. CI uses qwen3:0.6b aliased to expected model names — configurable via OLLAMA_CHAT_MODEL env var.

@matrixsparse matrixsparse force-pushed the feature/e2e-test-flink-standalone branch 4 times, most recently from b1de619 to 8e705b9 Compare June 7, 2026 03:40
@wenjin272

Copy link
Copy Markdown
Contributor

The nightly e2e tests threw the following exception:

Caused by: java.lang.annotation.IncompleteAnnotationException: org.apache.flink.agents.api.annotation.Action missing element listenEvents
    at java.base/sun.reflect.annotation.AnnotationInvocationHandler.invoke(AnnotationInvocationHandler.java:83)
    at com.sun.proxy.$Proxy21.listenEvents(Unknown Source)
    at org.apache.flink.agents.plan.AgentPlan.extractActionsFromAgent(AgentPlan.java:349)
    at org.apache.flink.agents.plan.AgentPlan.<init>(AgentPlan.java:138)
    at 
org.apache.flink.agents.runtime.env.RemoteExecutionEnvironment$RemoteAgentBuilder.apply(RemoteExecutionEnvironment.java:177)

In 0.3 we renamed the Action decorator/annotation parameter from listenEvents to listenEventTypes. However, the flink-agents that install.sh pulls into the Flink cluster is 0.2.1, so submitting a 0.3-based example to a standalone cluster causes a version mismatch.

I think we should replace the flink-agents in the cluster with the jar built from the current codebase. Concretely, we can first use install.sh to install Flink and flink-agents, then build flink-agents from the current branch and replace the flink-agents jar in FLINK_HOME/lib.

Since this issue is primarily aimed at testing the submission of the quickstart example, and version 0.3 has already entered code freeze, I am changing the fixVersion of this issue from 0.3.0 to 0.4.0. In version 0.4, we will focus on refining the maturity and stability of existing capabilities.

@wenjin272 wenjin272 added fixVersion/0.4.0 and removed fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. labels Jun 9, 2026
@matrixsparse matrixsparse force-pushed the feature/e2e-test-flink-standalone branch 2 times, most recently from f18b52d to e1182cd Compare June 9, 2026 15:18
@matrixsparse matrixsparse force-pushed the feature/e2e-test-flink-standalone branch from e1182cd to 75ac60d Compare June 9, 2026 15:40
@matrixsparse

Copy link
Copy Markdown
Contributor Author

@wenjin272 Good catch, thanks for the analysis! Fixed — the script now removes any pre-existing flink-agents-dist-*.jar from FLINK_HOME/lib/ before copying the freshly built one, ensuring the cluster always uses the jar from the current branch. Also adjusted the test to verify successful submission (job ID obtained) as the primary pass criterion, given the lightweight CI model limitations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs fixVersion/0.4.0 priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Tech Debt] Add e2e test for submitting example to flink standalone cluster.

4 participants