[api][routing] Pluggable in-chat LLM routing (ChatModelRouter + RoutingStrategy)#852
[api][routing] Pluggable in-chat LLM routing (ChatModelRouter + RoutingStrategy)#852purushah wants to merge 2 commits into
Conversation
…ngStrategy) Add a drop-in chat model that selects which underlying model serves each request, then delegates to it. The router is a CHAT_MODEL resource, so an agent points at it by name with no runtime, event, or agent-definition change. Selection is a pluggable SPI (`RoutingStrategy`), decomposed into orthogonal concerns: - RoutingStrategy — pure selection (request -> candidate name). Returning null means "abstain / no opinion". - FallbackPolicy — optional: try remaining candidates on error. - CachingStrategy — optional bounded-LRU memoization of the decision per conversation, so an expensive strategy (e.g. an LLM judge) runs once per conversation rather than once per tool-call round. Built-in strategies: - RuleBasedRoutingStrategy — deterministic keyword/regex rules + default. - LlmRoutingStrategy — a small "judge" model picks the candidate from each candidate's name/description (RouteLLM-style). Bring-your-own strategies are first-class: implement RoutingStrategy and reference it by fully-qualified class name; loaded via the thread context classloader (cluster-safe). ML/learned routing is supported the same way. Routing-miss semantics: a strategy that abstains (null) or names a non-candidate degrades to the configured `default` candidate (validated at construction; defaults to the first candidate) rather than failing the request. The LLM judge distinguishes a transient failure (abstain -> not cached, retried next round) from an unparseable reply (deterministic default). Security: an LLM/ML routing decision is a hint, not an authority — the user's message is sent to the judge, so cost/privilege/safety must not be gated solely on it (prompt-injection risk). This is documented on the strategy. Includes an example (LlmRoutingAgentExample) and unit tests covering rule selection, judge parsing (whole-token match, no substring mis-routing), stickiness, fallback, caching (incl. abstain-not-cached), and bring-your-own. Also mirror the RULE_BASED/LLM ResourceName constants on the Python side (ResourceName.RoutingStrategy.Java) and register RoutingStrategy in the cross-language ResourceName parity check.
| * which resolves its own). Override to skip the base class's connection resolution. | ||
| */ | ||
| @Override | ||
| public void open() { |
There was a problem hiding this comment.
Worth calling out the design here: the three-way split into selection / resilience / caching (RoutingStrategy / FallbackPolicy / CachingStrategy) lands the right seams, and the drop-in CHAT_MODEL story holds up against the real ChatModelAction + ResourceCache path. One non-obvious correctness point stood out at this open(): the no-op is safe precisely because a routed candidate is lazily open()-ed by ResourceCache.getResource() when it's resolved, so its connection is non-null before chat() runs. Documenting that invariant in the Javadoc rather than leaving it implicit is exactly right — it's the kind of thing that bites a future maintainer if it only lives in someone's head.
| // decision. Recompute each time instead. | ||
| return delegate.route(context); | ||
| } | ||
| String cached = cache.get(key); |
There was a problem hiding this comment.
The check-then-act here is non-atomic: cache.get(key) … delegate.route(context) … cache.put(key, chosen). Two async-pool threads handling the same conversation key on its first touch can both miss the cache and both invoke delegate.route() — i.e. two judge-model calls for the same conversation. The map itself is synchronizedMap, so there's no corruption, and the redundant compute is benign (last-writer-wins on the same key), so I wouldn't add locking for it — the cost isn't worth it. My only question is about the wording: the class Javadoc here and the LlmRoutingStrategy Javadoc ("the judge runs once per conversation") read as a hard guarantee, where this is really best-effort memoization. Would it be worth softening to "typically once per conversation" (or a one-line note that a concurrent first-touch can double-compute), so the contract matches the behavior? Or is the stronger phrasing intentional?
|
|
||
| /** A {@link ResourceContext} backed by a fixed name → resource map. */ | ||
| static ResourceContext context(Map<String, Resource> byName) { | ||
| return ResourceContext.fromGetResource( |
There was a problem hiding this comment.
The fakes resolve through ResourceContext.fromGetResource(...), which returns the registered resource directly — unlike the runtime ResourceCache.getResource(), which calls open() on the resolved resource before handing it back (ResourceCache.java:125). The upshot is that the load-bearing production invariant — a routed candidate gets open()-ed, so its connection is non-null before chat() — isn't exercised by any unit test; it rests on the ResourceCache path plus the e2e example. That's a structural property of resolving fakes without a real cache, not a defect in this PR. Does the e2e LlmRoutingAgentExample exercise the real-candidate (open()-then-chat()) path in CI, so the invariant is covered somewhere? If it doesn't, would a single router test backed by a real ResourceCache and a candidate that actually needs open() be worth adding, to pin the one fact the whole drop-in design depends on?
| # Milvus | ||
| MILVUS_VECTOR_STORE = "org.apache.flink.agents.integrations.vectorstores.milvus.MilvusVectorStore" | ||
|
|
||
| class RoutingStrategy: |
There was a problem hiding this comment.
This is the only Python surface the feature gets — the mirrored RoutingStrategy.Java.{RULE_BASED, LLM} FQCN constants, with no Python ChatModelRouter, no Python RoutingStrategy SPI, and no Python tests. CLAUDE.md asks for Java/Python parity, though that's scoped to "when changing shared logic," and this PR adds a new Java resource resolved by name rather than touching shared logic. The mirrored constants read as a deliberate bridge — a Python agent points a CHAT_MODEL at the Java router by name. Is that the intended Python story for v1 ("reference the Java router by name," with a native Python SPI as a possible follow-up), or is Python routing out of scope for now? Asking mainly so the intent is on the record — the constants imply Python users are expected to reach this, so it'd help to state how far that's meant to go.
| private final Map<String, Object> metadata; | ||
|
|
||
| public RoutingCandidate(String name, String description, Map<String, Object> metadata) { | ||
| this.name = Objects.requireNonNull(name, "candidate name must not be null"); |
There was a problem hiding this comment.
The ctor null-checks name but lets an empty string through, and in LlmRoutingStrategy.parseChoice the whole-token regex for an empty quoted name over-matches almost any boundary → mis-route. It takes a pathological config (an empty candidate name) to hit, so low priority — a one-line name.isEmpty() guard alongside the requireNonNull closes it cleanly.
…at test Review follow-ups from @weiqingy on the routing PR: - ChatModelRouter.open(): document the load-bearing invariant the no-op relies on — a routed candidate is lazily open()-ed by ResourceCache.getResource() on first resolution, so its connection is non-null before chat() runs. - CachingStrategy / LlmRoutingStrategy: soften "runs once per conversation" to "typically once" and document that memoization is best-effort (a concurrent first-touch on the same key can double-compute; synchronized map, last-writer- wins, benign — so no locking). - RoutingCandidate: reject an empty name (not just null) — an empty name has no resolvable resource and would make LlmRoutingStrategy.parseChoice's whole-token match over-match arbitrary boundaries (mis-route). - Tests: add ChatModelRouterTest cases pinning the open-before-chat invariant (candidate resolved through an opening ResourceContext, mirroring ResourceCache; plus the negative case proving it is load-bearing), and RoutingCandidateTest for the null/empty name guards. 39 routing tests pass; spotless:check clean under JDK 17.
What is the purpose of the change
Adds a drop-in chat model that routes each request to the best underlying model, then delegates to it. The router is a
CHAT_MODELresource, so an agent points at it by name with no change to the runtime, events, or agent definition.This is the in-chat selector (which LLM serves a single
chat()call). A DataStream-level content-based agent router (branching records across agent operators) is a separate, follow-up concern.Brief change log
RoutingStrategy— pluggable selection SPI (request -> candidate name). Selection is a pure concern; returningnullmeans "abstain / no opinion".ChatModelRouter— orchestrates select → (optional cache) → validate → delegate. A strategy that abstains (null) or names a non-candidate is a routing miss and degrades to the configureddefaultcandidate (validated at construction; defaults to the first candidate) rather than failing the request.FallbackPolicy— optional: try remaining candidates on error.CachingStrategy— optional bounded-LRU memoization of the decision per conversation, so an expensive strategy (e.g. an LLM judge) runs once per conversation, not once per tool-call round. Abstentions (null) are never cached.RuleBasedRoutingStrategy— deterministic keyword/regex rules + default.LlmRoutingStrategy— a small "judge" model picks the candidate from each candidate's name/description (RouteLLM-style). Distinguishes a transient judge failure (abstain → retried next round, uncached) from an unparseable reply (deterministic default). Parses by whole-token match (no substring mis-routing, e.g.gpt-4o-miniwon't match agpt-4candidate).RoutingStrategyand reference it by fully-qualified class name; loaded via the thread context classloader (cluster-safe). ML/learned routing is supported the same way.LlmRoutingAgentExampleand unit tests.Verifying this change
This change adds tests and can be verified as follows:
api/.../chat/model/routing/covering rule selection, judge parsing (whole-token match), stickiness across tool-call rounds, fallback, caching (incl. abstain-not-cached), routing-miss degrade-to-default, and bring-your-own loading. All pass;spotless:checkclean (JDK 17).Does this pull request potentially affect one of the following parts:
org.apache.flink.agents.api.chat.model.routingpackage (additive; no existing API changed).CHAT_MODELresource resolved by name)Security note
An LLM/ML routing decision is a hint, not an authority — the user's message is sent to the judge model, so a routing decision is susceptible to prompt injection. Cost/privilege/safety controls must not be gated solely on it. This is documented on
LlmRoutingStrategy.Documentation
Documentation
doc-neededdoc-not-neededdoc-included