Skip to content

feat(contract): evalReportingSuite — one call from runs (or a run dir) to analysis.json#272

Merged
drewstone merged 1 commit into
mainfrom
feat/eval-reporting-suite
Jun 22, 2026
Merged

feat(contract): evalReportingSuite — one call from runs (or a run dir) to analysis.json#272
drewstone merged 1 commit into
mainfrom
feat/eval-reporting-suite

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Adds evalReportingSuite(input, opts) to the /contract public surface: the one-call path from a set of runs — RunRecord[] in memory, or a .json / .jsonl file, or a directory of them — to a single analysis.json.

It is a thin wrapper, not a new analysis engine. All distributions, paired stats/lift, and the findings rollup come from the existing analyzeRuns primitive verbatim; the suite only resolves the input into validated records, calls analyzeRuns with the options you'd pass it directly, wraps the result in a small provenance envelope, and optionally writes the artifact.

// From a directory of run files, write ./runs/analysis.json:
const suite = await evalReportingSuite('./runs', { write: true })
// From records already in memory:
const suite = await evalReportingSuite(records, { analyze: { decisionThreshold: 0.03 } })
suite.report // the InsightReport — distributions, paired lift, findings rollup

New surface (all additive, /contract)

  • evalReportingSuite + EvalReportingSuiteInput / EvalReportingSuiteOptions / EvalReportingSuiteResult.
  • fromRunRecordDir (new intake adapter, alongside the existing from* adapters) + FromRunRecordDirOptions / FromRunRecordDirResult / RunRecordRejection. Loads a .json (array) / .jsonl (one record per line) file or a directory of them, validating each record at the boundary via parseRunRecordSafe. Fails loud on an invalid record by default; onInvalid: 'collect' keeps the valid ones and returns the rejects.

Design notes

  • Reuses analyzeRuns (composite/cost distributions, per-dimension stats, paired-bootstrap lift, failure-mode + cluster rollup, recommendations) — the test asserts the wrapped report is byte-identical to calling analyzeRuns directly, so the wrapper can't drift into reimplementing analysis.
  • The JSON/JSONL parsing mirrors the proven loadRunRecords path already used by the analyze_runs eval tool, promoted into the public intake family where it belongs.
  • A re-run over a directory ignores its own analysis.json output (never ingests its own artifact).
  • write:true on in-memory records fails loud — there's no directory to anchor the artifact to; pass an explicit path instead.

Verification

  • pnpm run lint — 0 errors (pre-existing warnings/infos in unrelated files untouched)
  • pnpm run typecheck — clean
  • pnpm run build — clean (incl. OpenAPI spec emit)
  • pnpm test — 247 files / 2512 tests pass (2 pre-existing skips), incl. 11 new suite tests

Version trio bumped together: 0.95.10.96.0 (package.json + clients/python/pyproject.toml + __init__.py).

…) to analysis.json

Thin wrapper over analyzeRuns + a new fromRunRecordDir intake adapter.
Resolves a RunRecord[] or a .json/.jsonl file or directory into validated
records, runs analyzeRuns (distributions, paired lift, findings rollup),
and optionally writes a single analysis.json. No analysis logic of its
own — pure composition + I/O over the existing reporting primitives.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 18b015b2

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T13:35:02Z

@drewstone drewstone merged commit 646ad9e into main Jun 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants