feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665)#175
feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665)#175ctawiah wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 83a342f. Configure here.
| private JudgeResult awaitResult(Judge judge, Future<JudgeResult> future) { | ||
| String key = judge.getAIConfig().getKey(); | ||
| try { | ||
| return future.get(perJudgeTimeout.toMillis(), TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Sequential waits break judge timeouts
High Severity
Evaluator starts all judges concurrently but awaits each Future in list order with a full perJudgeTimeout on every get. That timeout is measured from each get call, not from when the judge task started, so later judges can run far longer than the configured cap and evaluate can take up to judges.size() × perJudgeTimeout when multiple judges hang.
Reviewed by Cursor Bugbot for commit 83a342f. Configure here.
|
|
||
| String evaluationInput = buildEvaluationInput(input, output); | ||
| RunnerResult response = tracker.trackMetricsOf(RunnerResult::getMetrics, | ||
| () -> runner.run(evaluationInput)); |
There was a problem hiding this comment.
This may come later but we typically use structured outputs and would need to define the output shape for the run.
| LDContext context, | ||
| AIJudgeConfigDefault defaultValue, | ||
| Map<String, Object> variables, | ||
| Runner runner, |
There was a problem hiding this comment.
It might be worth talking about the plans for the experimental features of the SDK if you haven't already. The create judge method should be marked experimental. In node and python we don't accept a runner but build it internally so this is a change in the SDK. Not sure if this is temporary and will be addressed later.
19d0f4f to
2ca9fc8
Compare
…C-2665) Implements the AIEVALS manual-only evaluation path: - Runner SPI and RunnerResult for caller-supplied model invocation - Judge: sampling decided before invocation, well-known input format, score/reasoning parsing with range validation, invocation tracked via trackMetricsOf (does not emit trackJudgeResult; caller's responsibility) - Evaluator: per-judge fault isolation and per-judge timeout, order-preserving results, noop() returns an empty list; sampling-rate normalization on Judge - LDAIClient.createJudge: fires only $ld:ai:usage:create-judge, resolves the judge config via the internal evaluate path, returns null when disabled or when no runner is supplied Automatic judge auto-attachment and provider runners are deferred past v1.0. README documents the manual-only flow and the auto-attach descope. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
83a342f to
f6d4a4c
Compare


Requirements
Related issues
Describe the solution you've provided
Implements the manual-only evaluation path for AI Configs. v1.0 does not auto-invoke judges on completion/agent calls; the caller drives evaluation:
createJudge()→judge.evaluate(...)→ track the result yourself.RunnerSPI +RunnerResult— caller-supplied model invocation.RunnerResultcarriescontent, the runMetrics, andparsedstructured output ({score, reasoning}). Provider-specific runners ship post-1.0.Judge— sampling is decided before invoking the model; input is formatted asMESSAGE HISTORY:\n{input}\n\nRESPONSE TO EVALUATE:\n{output}; the runner is invoked via the tracker'strackMetricsOfso invocation metrics are recorded;score(0.0–1.0, out-of-range → failure) andreasoningare parsed. The judge returns aJudgeResultbut does not calltrackJudgeResult— recording is the caller's responsibility.evaluateMessagesrenders<role>: <content>history and delegates toevaluate. Sampling rate is normalized (NaN/Infinity → 1.0, negative → 0.0, >1 → 1.0).Evaluator— runs a set of judges with per-judge fault isolation (a failing/timing-out judge yields a failedJudgeResult; others are preserved in order) and a per-judge timeout so a hung judge can't stall the chain.noop()returns an empty list with no warnings. Thread-safe; uses a short-lived executor perevaluatecall.LDAIClient.createJudge— fires only$ld:ai:usage:create-judge, resolves the judge config through the internal evaluate path (so no$ld:ai:usage:judge-configevent), and returnsnullif the config is disabled or no runner is supplied.Async surface is synchronous, consistent with the rest of this server SDK; concurrency for per-judge timeout is internal to
Evaluator.Tests
JudgeTest— scoring/metric key, input formatting, zero-sampling skip (runner not invoked), missing metric key, out-of-range score, missing reasoning, runner throw, runner failure metrics,evaluateMessagesrendering, sample-rate normalization.EvaluatorTest—noop()empty, order preservation, fault isolation, timeout isolation, completion-order independence.LDAIClientImplTest—createJudgefires only create-judge (not judge-config), returns aJudgewhen enabled, null when disabled, null when no runner.Describe alternatives you've considered
A
CompletableFuture-based async API was considered but rejected for consistency with the synchronous server SDK surface. Automatic sample-rate-driven judge auto-attachment and provider runners are intentionally deferred past v1.0 (aligns with the .NET descope).Additional context
JudgeResultwas added in #174 (AIC-2664) and is reused here.Note
Medium Risk
New public API and usage telemetry paths; judge runs depend on caller-supplied runners and correct manual tracking, but no changes to core flag evaluation or auth.
Overview
Adds manual-only AI response evaluation to the server AI SDK: callers use
LDAIClient.createJudge()with a customRunner, runJudge.evaluate()(orevaluateMessages()), and record scores viatrackJudgeResultthemselves—no auto-attachment on completion/agent calls in v1.0.New public
Runner/RunnerResultSPI for model calls;Judgeapplies sampling, formats judge input, invokes the runner through the config tracker for invocation metrics, and parses{score, reasoning}. InternalEvaluatorruns multiple judges concurrently with per-judge timeouts and fault isolation.createJudgeemits only$ld:ai:usage:create-judge(notjudge-config) and returnsnullwhen disabled or no runner. README documents tracking and the manual judge flow.Reviewed by Cursor Bugbot for commit f6d4a4c. Bugbot is set up for automated code reviews on this repo. Configure here.