Skip to content

[api][routing] Pluggable in-chat LLM routing (ChatModelRouter + RoutingStrategy)#852

Open
purushah wants to merge 2 commits into
apache:mainfrom
purushah:routing-pr
Open

[api][routing] Pluggable in-chat LLM routing (ChatModelRouter + RoutingStrategy)#852
purushah wants to merge 2 commits into
apache:mainfrom
purushah:routing-pr

Conversation

@purushah

@purushah purushah commented Jun 15, 2026

Copy link
Copy Markdown

What is the purpose of the change

Adds a drop-in chat model that routes each request to the best underlying model, then delegates to it. The router is a CHAT_MODEL resource, so an agent points at it by name with no change to the runtime, events, or agent definition.

This is the in-chat selector (which LLM serves a single chat() call). A DataStream-level content-based agent router (branching records across agent operators) is a separate, follow-up concern.

Brief change log

  • RoutingStrategy — pluggable selection SPI (request -> candidate name). Selection is a pure concern; returning null means "abstain / no opinion".
  • ChatModelRouter — orchestrates select → (optional cache) → validate → delegate. A strategy that abstains (null) or names a non-candidate is a routing miss and degrades to the configured default candidate (validated at construction; defaults to the first candidate) rather than failing the request.
  • FallbackPolicy — optional: try remaining candidates on error.
  • CachingStrategy — optional bounded-LRU memoization of the decision per conversation, so an expensive strategy (e.g. an LLM judge) runs once per conversation, not once per tool-call round. Abstentions (null) are never cached.
  • Built-in strategies:
    • RuleBasedRoutingStrategy — deterministic keyword/regex rules + default.
    • LlmRoutingStrategy — a small "judge" model picks the candidate from each candidate's name/description (RouteLLM-style). Distinguishes a transient judge failure (abstain → retried next round, uncached) from an unparseable reply (deterministic default). Parses by whole-token match (no substring mis-routing, e.g. gpt-4o-mini won't match a gpt-4 candidate).
  • Bring-your-own strategies are first-class: implement RoutingStrategy and reference it by fully-qualified class name; loaded via the thread context classloader (cluster-safe). ML/learned routing is supported the same way.
  • Adds LlmRoutingAgentExample and unit tests.

Verifying this change

This change adds tests and can be verified as follows:

  • Unit tests under api/.../chat/model/routing/ covering rule selection, judge parsing (whole-token match), stickiness across tool-call rounds, fallback, caching (incl. abstain-not-cached), routing-miss degrade-to-default, and bring-your-own loading. All pass; spotless:check clean (JDK 17).

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API: yes — adds the org.apache.flink.agents.api.chat.model.routing package (additive; no existing API changed).
  • The serializers: no
  • The runtime per-record code paths: no (router is a CHAT_MODEL resource resolved by name)
  • Anything that affects deployment or recovery: no — preserves exactly-once / keyed-state / checkpoint semantics (no new operator, no nested invocation).

Security note

An LLM/ML routing decision is a hint, not an authority — the user's message is sent to the judge model, so a routing decision is susceptible to prompt injection. Cost/privilege/safety controls must not be gated solely on it. This is documented on LlmRoutingStrategy.

Documentation

  • New public package is documented via javadoc on each type. Built-in strategies, the abstain/routing-miss contract, and the bring-your-own extension point are described on the SPI.

Documentation

  • doc-needed
  • doc-not-needed
  • doc-included

@github-actions github-actions Bot added doc-label-missing The Bot applies this label either because none or multiple labels were provided. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels Jun 15, 2026
…ngStrategy)

Add a drop-in chat model that selects which underlying model serves each
request, then delegates to it. The router is a CHAT_MODEL resource, so an
agent points at it by name with no runtime, event, or agent-definition change.

Selection is a pluggable SPI (`RoutingStrategy`), decomposed into orthogonal
concerns:

- RoutingStrategy — pure selection (request -> candidate name). Returning null
  means "abstain / no opinion".
- FallbackPolicy — optional: try remaining candidates on error.
- CachingStrategy — optional bounded-LRU memoization of the decision per
  conversation, so an expensive strategy (e.g. an LLM judge) runs once per
  conversation rather than once per tool-call round.

Built-in strategies:

- RuleBasedRoutingStrategy — deterministic keyword/regex rules + default.
- LlmRoutingStrategy — a small "judge" model picks the candidate from each
  candidate's name/description (RouteLLM-style).

Bring-your-own strategies are first-class: implement RoutingStrategy and
reference it by fully-qualified class name; loaded via the thread context
classloader (cluster-safe). ML/learned routing is supported the same way.

Routing-miss semantics: a strategy that abstains (null) or names a
non-candidate degrades to the configured `default` candidate (validated at
construction; defaults to the first candidate) rather than failing the
request. The LLM judge distinguishes a transient failure (abstain -> not
cached, retried next round) from an unparseable reply (deterministic default).

Security: an LLM/ML routing decision is a hint, not an authority — the user's
message is sent to the judge, so cost/privilege/safety must not be gated
solely on it (prompt-injection risk). This is documented on the strategy.

Includes an example (LlmRoutingAgentExample) and unit tests covering rule
selection, judge parsing (whole-token match, no substring mis-routing),
stickiness, fallback, caching (incl. abstain-not-cached), and bring-your-own.

Also mirror the RULE_BASED/LLM ResourceName constants on the Python side
(ResourceName.RoutingStrategy.Java) and register RoutingStrategy in the
cross-language ResourceName parity check.
@github-actions github-actions Bot added doc-needed Your PR changes impact docs. and removed doc-label-missing The Bot applies this label either because none or multiple labels were provided. labels Jun 15, 2026

@weiqingy weiqingy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on, @purushah. A few questions inline.

* which resolves its own). Override to skip the base class's connection resolution.
*/
@Override
public void open() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth calling out the design here: the three-way split into selection / resilience / caching (RoutingStrategy / FallbackPolicy / CachingStrategy) lands the right seams, and the drop-in CHAT_MODEL story holds up against the real ChatModelAction + ResourceCache path. One non-obvious correctness point stood out at this open(): the no-op is safe precisely because a routed candidate is lazily open()-ed by ResourceCache.getResource() when it's resolved, so its connection is non-null before chat() runs. Documenting that invariant in the Javadoc rather than leaving it implicit is exactly right — it's the kind of thing that bites a future maintainer if it only lives in someone's head.

// decision. Recompute each time instead.
return delegate.route(context);
}
String cached = cache.get(key);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check-then-act here is non-atomic: cache.get(key)delegate.route(context)cache.put(key, chosen). Two async-pool threads handling the same conversation key on its first touch can both miss the cache and both invoke delegate.route() — i.e. two judge-model calls for the same conversation. The map itself is synchronizedMap, so there's no corruption, and the redundant compute is benign (last-writer-wins on the same key), so I wouldn't add locking for it — the cost isn't worth it. My only question is about the wording: the class Javadoc here and the LlmRoutingStrategy Javadoc ("the judge runs once per conversation") read as a hard guarantee, where this is really best-effort memoization. Would it be worth softening to "typically once per conversation" (or a one-line note that a concurrent first-touch can double-compute), so the contract matches the behavior? Or is the stronger phrasing intentional?


/** A {@link ResourceContext} backed by a fixed name → resource map. */
static ResourceContext context(Map<String, Resource> byName) {
return ResourceContext.fromGetResource(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fakes resolve through ResourceContext.fromGetResource(...), which returns the registered resource directly — unlike the runtime ResourceCache.getResource(), which calls open() on the resolved resource before handing it back (ResourceCache.java:125). The upshot is that the load-bearing production invariant — a routed candidate gets open()-ed, so its connection is non-null before chat() — isn't exercised by any unit test; it rests on the ResourceCache path plus the e2e example. That's a structural property of resolving fakes without a real cache, not a defect in this PR. Does the e2e LlmRoutingAgentExample exercise the real-candidate (open()-then-chat()) path in CI, so the invariant is covered somewhere? If it doesn't, would a single router test backed by a real ResourceCache and a candidate that actually needs open() be worth adding, to pin the one fact the whole drop-in design depends on?

# Milvus
MILVUS_VECTOR_STORE = "org.apache.flink.agents.integrations.vectorstores.milvus.MilvusVectorStore"

class RoutingStrategy:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only Python surface the feature gets — the mirrored RoutingStrategy.Java.{RULE_BASED, LLM} FQCN constants, with no Python ChatModelRouter, no Python RoutingStrategy SPI, and no Python tests. CLAUDE.md asks for Java/Python parity, though that's scoped to "when changing shared logic," and this PR adds a new Java resource resolved by name rather than touching shared logic. The mirrored constants read as a deliberate bridge — a Python agent points a CHAT_MODEL at the Java router by name. Is that the intended Python story for v1 ("reference the Java router by name," with a native Python SPI as a possible follow-up), or is Python routing out of scope for now? Asking mainly so the intent is on the record — the constants imply Python users are expected to reach this, so it'd help to state how far that's meant to go.

private final Map<String, Object> metadata;

public RoutingCandidate(String name, String description, Map<String, Object> metadata) {
this.name = Objects.requireNonNull(name, "candidate name must not be null");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ctor null-checks name but lets an empty string through, and in LlmRoutingStrategy.parseChoice the whole-token regex for an empty quoted name over-matches almost any boundary → mis-route. It takes a pathological config (an empty candidate name) to hit, so low priority — a one-line name.isEmpty() guard alongside the requireNonNull closes it cleanly.

…at test

Review follow-ups from @weiqingy on the routing PR:

- ChatModelRouter.open(): document the load-bearing invariant the no-op relies
  on — a routed candidate is lazily open()-ed by ResourceCache.getResource() on
  first resolution, so its connection is non-null before chat() runs.
- CachingStrategy / LlmRoutingStrategy: soften "runs once per conversation" to
  "typically once" and document that memoization is best-effort (a concurrent
  first-touch on the same key can double-compute; synchronized map, last-writer-
  wins, benign — so no locking).
- RoutingCandidate: reject an empty name (not just null) — an empty name has no
  resolvable resource and would make LlmRoutingStrategy.parseChoice's whole-token
  match over-match arbitrary boundaries (mis-route).
- Tests: add ChatModelRouterTest cases pinning the open-before-chat invariant
  (candidate resolved through an opening ResourceContext, mirroring ResourceCache;
  plus the negative case proving it is load-bearing), and RoutingCandidateTest
  for the null/empty name guards.

39 routing tests pass; spotless:check clean under JDK 17.
@wenjin272 wenjin272 added fixVersion/0.4.0 and removed fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. labels Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-needed Your PR changes impact docs. fixVersion/0.4.0 priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants