Skip to content

feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665)#175

Open
ctawiah wants to merge 2 commits into
feat/ai-sdk-trackerfrom
feat/ai-sdk-evals
Open

feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665)#175
ctawiah wants to merge 2 commits into
feat/ai-sdk-trackerfrom
feat/ai-sdk-evals

Conversation

@ctawiah

@ctawiah ctawiah commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Requirements

  • I have added test coverage for new or changed functionality
  • I have followed the repository's pull request submission guidelines
  • I have validated my changes against all supported platform versions

Related issues

Stacked on #174 (AIC-2664). Review/merge that first; the diff here is against feat/ai-sdk-tracker.

Describe the solution you've provided

Implements the manual-only evaluation path for AI Configs. v1.0 does not auto-invoke judges on completion/agent calls; the caller drives evaluation: createJudge()judge.evaluate(...) → track the result yourself.

  • Runner SPI + RunnerResult — caller-supplied model invocation. RunnerResult carries content, the run Metrics, and parsed structured output ({score, reasoning}). Provider-specific runners ship post-1.0.
  • Judge — sampling is decided before invoking the model; input is formatted as MESSAGE HISTORY:\n{input}\n\nRESPONSE TO EVALUATE:\n{output}; the runner is invoked via the tracker's trackMetricsOf so invocation metrics are recorded; score (0.0–1.0, out-of-range → failure) and reasoning are parsed. The judge returns a JudgeResult but does not call trackJudgeResult — recording is the caller's responsibility. evaluateMessages renders <role>: <content> history and delegates to evaluate. Sampling rate is normalized (NaN/Infinity → 1.0, negative → 0.0, >1 → 1.0).
  • Evaluator — runs a set of judges with per-judge fault isolation (a failing/timing-out judge yields a failed JudgeResult; others are preserved in order) and a per-judge timeout so a hung judge can't stall the chain. noop() returns an empty list with no warnings. Thread-safe; uses a short-lived executor per evaluate call.
  • LDAIClient.createJudge — fires only $ld:ai:usage:create-judge, resolves the judge config through the internal evaluate path (so no $ld:ai:usage:judge-config event), and returns null if the config is disabled or no runner is supplied.
  • README documents the manual-only flow and the auto-attach descope.

Async surface is synchronous, consistent with the rest of this server SDK; concurrency for per-judge timeout is internal to Evaluator.

Tests

  • JudgeTest — scoring/metric key, input formatting, zero-sampling skip (runner not invoked), missing metric key, out-of-range score, missing reasoning, runner throw, runner failure metrics, evaluateMessages rendering, sample-rate normalization.
  • EvaluatorTestnoop() empty, order preservation, fault isolation, timeout isolation, completion-order independence.
  • LDAIClientImplTestcreateJudge fires only create-judge (not judge-config), returns a Judge when enabled, null when disabled, null when no runner.

Describe alternatives you've considered

A CompletableFuture-based async API was considered but rejected for consistency with the synchronous server SDK surface. Automatic sample-rate-driven judge auto-attachment and provider runners are intentionally deferred past v1.0 (aligns with the .NET descope).

Additional context

JudgeResult was added in #174 (AIC-2664) and is reused here.


Note

Medium Risk
New public API and usage telemetry paths; judge runs depend on caller-supplied runners and correct manual tracking, but no changes to core flag evaluation or auth.

Overview
Adds manual-only AI response evaluation to the server AI SDK: callers use LDAIClient.createJudge() with a custom Runner, run Judge.evaluate() (or evaluateMessages()), and record scores via trackJudgeResult themselves—no auto-attachment on completion/agent calls in v1.0.

New public Runner / RunnerResult SPI for model calls; Judge applies sampling, formats judge input, invokes the runner through the config tracker for invocation metrics, and parses {score, reasoning}. Internal Evaluator runs multiple judges concurrently with per-judge timeouts and fault isolation. createJudge emits only $ld:ai:usage:create-judge (not judge-config) and returns null when disabled or no runner. README documents tracking and the manual judge flow.

Reviewed by Cursor Bugbot for commit f6d4a4c. Bugbot is set up for automated code reviews on this repo. Configure here.

@ctawiah ctawiah marked this pull request as ready for review June 11, 2026 03:05
@ctawiah ctawiah requested a review from a team as a code owner June 11, 2026 03:05

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 83a342f. Configure here.

private JudgeResult awaitResult(Judge judge, Future<JudgeResult> future) {
String key = judge.getAIConfig().getKey();
try {
return future.get(perJudgeTimeout.toMillis(), TimeUnit.MILLISECONDS);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequential waits break judge timeouts

High Severity

Evaluator starts all judges concurrently but awaits each Future in list order with a full perJudgeTimeout on every get. That timeout is measured from each get call, not from when the judge task started, so later judges can run far longer than the configured cap and evaluate can take up to judges.size() × perJudgeTimeout when multiple judges hang.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 83a342f. Configure here.


String evaluationInput = buildEvaluationInput(input, output);
RunnerResult response = tracker.trackMetricsOf(RunnerResult::getMetrics,
() -> runner.run(evaluationInput));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may come later but we typically use structured outputs and would need to define the output shape for the run.

LDContext context,
AIJudgeConfigDefault defaultValue,
Map<String, Object> variables,
Runner runner,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth talking about the plans for the experimental features of the SDK if you haven't already. The create judge method should be marked experimental. In node and python we don't accept a runner but build it internally so this is a change in the SDK. Not sure if this is temporary and will be addressed later.

@ctawiah ctawiah force-pushed the feat/ai-sdk-tracker branch from 19d0f4f to 2ca9fc8 Compare June 11, 2026 21:29
ctawiah and others added 2 commits June 11, 2026 17:30
…C-2665)

Implements the AIEVALS manual-only evaluation path:

- Runner SPI and RunnerResult for caller-supplied model invocation
- Judge: sampling decided before invocation, well-known input format,
  score/reasoning parsing with range validation, invocation tracked via
  trackMetricsOf (does not emit trackJudgeResult; caller's responsibility)
- Evaluator: per-judge fault isolation and per-judge timeout, order-preserving
  results, noop() returns an empty list; sampling-rate normalization on Judge
- LDAIClient.createJudge: fires only $ld:ai:usage:create-judge, resolves the
  judge config via the internal evaluate path, returns null when disabled or
  when no runner is supplied

Automatic judge auto-attachment and provider runners are deferred past v1.0.
README documents the manual-only flow and the auto-attach descope.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@ctawiah ctawiah force-pushed the feat/ai-sdk-evals branch from 83a342f to f6d4a4c Compare June 11, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants