Skip to content

feat: add retrieval event logging and evaluation harness#5

Open
chrishonson wants to merge 1 commit into
mainfrom
feat/retrieval-evaluation
Open

feat: add retrieval event logging and evaluation harness#5
chrishonson wants to merge 1 commit into
mainfrom
feat/retrieval-evaluation

Conversation

@chrishonson

Copy link
Copy Markdown
Owner

Summary

Opt-in retrieval observability + a full offline + production retrieval evaluation harness.

What

  • When RETRIEVAL_EVENT_LOGGING_ENABLED=true, search_context and fetch_context now write rich events (full query, filters, ranked result ids + derived scores, fetch ids) to a dedicated retrieval_query_events collection (90 d TTL, deny-all rules, linked to the regular tool events).
  • New retrievalEvaluation module + CLI:
    • eval:generate – creates a deterministic synthetic corpus in the isolated memory_vectors_eval collection and the corresponding golden retrieval_eval_cases.
    • eval:import – converts real production search→fetch observations (via correlateImplicitFetches) into labeled cases.
    • eval:run – executes the cases (either directly against the eval partition or against a live MCP endpoint) and computes standard metrics (hit@k, recall@k, MRR, zero-result rate, p50/p95 latency).
  • Separate eval collection + indexes so benchmark work never touches production memory.
  • Updated wiring in config, MCP server, observer, tests, smoke test, docs, and Firestore rules/indexes/TTL script.

Motivation

Gives us repeatable, automatable measurement of retrieval quality using either controlled fixtures or actual usage signals (successful fetches after searches are treated as positives).

New scripts

npm --prefix functions run eval:generate
npm --prefix functions run eval:import -- --lookback-hours 168
npm --prefix functions run eval:run -- --mode isolated   # or production

Verification

  • npm --prefix functions test → 49/49 passing
  • New unit + integration coverage for the eval paths and event emission
  • Smoke test updated with --fetch-first helper

See the "Retrieval evaluation" section added to the README for usage.

- Add RETRIEVAL_EVENT_LOGGING_ENABLED (default false) for detailed
  search/fetch telemetry in separate retrieval_query_events collection
  (90-day TTL via Firestore, linked to parent tool_call events)
- New retrievalEvaluation.ts implementing:
  - FirestoreRetrievalEvalCaseStore + replace/list helpers
  - buildSyntheticEvalCases + correlateImplicitFetches (search+fetch
    pairs within time window become positive labels)
  - computeRetrievalEvalMetrics (hit@k, recall@k, MRR, zero-result rate,
    p50/p95 latency)
- New functions/scripts/retrieval-eval.ts + package.json scripts:
    eval:generate  (plant isolated corpus + cases in memory_vectors_eval)
    eval:import    (turn recent production retrieval events into cases)
    eval:run       (--mode isolated|production, --cutoffs, reports JSON metrics)
- Wire new config (retrievalEventLoggingEnabled, topK, memoryCollection)
  through app.ts into createMetaCortexMcpServer and observer
- mcpServer: extend execute wrapper to accept optional retrieval
  descriptors; search_context and fetch_context now emit rich events
  when enabled (with scores converted from cosine distance)
- Test updates: InMemoryToolCallObserver.listRetrievalEvents, integration
  assertions, config boolean parser tests, fakes
- Smoke test: --fetch-first / MCP_FETCH_FIRST to drive fetch after search
- Infra: firestore indexes for memory_vectors_eval, rules deny for new
  collections (retrieval_query_events, retrieval_eval_cases, *_eval*),
  deploy-firestore-ttl.sh updated
- Docs + .env.example updated

All tests green (49 passing).
@chrishonson

Copy link
Copy Markdown
Owner Author

@claude review and how long do you think i'll need to run this in prod before we start looking at the analytics?

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @chrishonson's task in 2m 26s —— View job


Code Review + Analytics Timeline

Tasks

  • Gather context
  • Read key new files (retrievalEvaluation.ts, observability.ts, config.ts, mcpServer.ts)
  • Read supporting files (firestore.indexes.json, firestore.rules, tests)
  • Post review findings
  • Answer analytics timeline question

Review

Overall this is clean, well-structured work. The architecture — opt-in flag, separate collection, dedicated eval partition, implicit-fetch correlation for free labels — is solid. A few things worth looking at:

Correctness / Subtle Bugs

recall@k averaging includes zero-positive observations in denominator (retrievalEvaluation.ts:259–276)

When positiveIds.size === 0, the function returns total unchanged (contributes 0), but safeAverage still divides by observations.length including those cases. So an eval case with no positive IDs is treated as 0 recall and silently drags the metric down. In practice positive_ids should never be empty (the correlator only produces cases with ≥1 positive), but if even one slips through it'll corrupt recall@k without a warning. A quick guard would close this:

// at the top of the reduce callback for recall
if (positiveIds.size === 0) return total; // ← today this silently penalizes; instead skip denominator too

Fix this →


deleteCollection has no guard against infinite looping (retrieval-eval.ts:333–349)

The while (true) loop has no iteration cap or catch. If Firestore returns a persistent error mid-batch the loop hangs indefinitely. Since this only runs locally it's low-risk, but a maxIterations guard or a catch that breaks out would make failures debuggable.


Fetch error path emits no found field (mcpServer.ts:204–211)

On a 404 from fetch_context, the error path records retrieval: retrieval.request which has no found key. correlateImplicitFetches correctly filters to event.found === true so not-found fetches are excluded from labels. This is intentional and works, but it's non-obvious — a comment would help the next reader.


Minor / Nits

Home-rolled .env parser doesn't handle quoted values (retrieval-eval.ts:356–386)

loadEnvironment splits on the first = and takes the raw RHS. A value like GEMINI_API_KEY="abc def" would include the quotes in the env var. dotenv is probably already in the transitive dep graph — worth using it here rather than growing the parser.

MEMORY_EVENT_TTL_MS is reused for retrieval events (observability.ts:9,155)

Both collections end up with the same 90-day TTL, but the constant's name implies it belongs to memory_events. Either extract a shared AUDIT_EVENT_TTL_MS constant or add a comment explaining the shared TTL. Minor readability.

retrieval_eval_cases has no TTL (firestore.indexes.json, deploy-firestore-ttl.sh)

Probably intentional — eval cases persist across runs by design. But the TTL script covers three of the four new collections and is silent on why retrieval_eval_cases is excluded. A comment in the script would clarify intent.

await main() at top level has no error formatter (retrieval-eval.ts:492)

Node will print the stack trace on rejection, which is fine. But wrapping it in main().catch(err => { console.error(err.message); process.exitCode = 1; }) gives cleaner CLI output without the noise.


What's Good

  • Opt-in flag (RETRIEVAL_EVENT_LOGGING_ENABLED=false default) means existing deployments see no behavior change.
  • retrieval_query_events is completely server-side (deny-all rules) — no accidental client leakage.
  • correlateImplicitFetches scoping by client_id + memory_collection + containment-in-results is exactly right; it avoids cross-client label contamination.
  • Isolated eval collection (memory_vectors_eval) with its own indexes means benchmark runs can't corrupt or co-mingle with production vectors.
  • schema_version: 1 field on RetrievalEvalCase gives a forward migration path.
  • Tests are lean and cover the key behaviors (synthetic case building, correlation logic, metrics, partition replacement).
  • The label_source: "implicit_fetch" field is a nice explicit note that these labels are derived, not human-judged.

How Long to Run in Production Before Looking at Analytics

tl;dr: ~1 week for a first signal, 3–4 weeks for a stable baseline you can regression-test against.

The import-production command (--lookback-hours 168 = 7 days) constructs one eval case per unique search→fetch pair. The quality of the metrics depends entirely on how many of those pairs you accumulate:

Volume Cases What you can trust
< 1 week < 20 Only the synthetic isolated eval (eval:generate) is meaningful. Don't draw conclusions from production data yet.
1–2 weeks 20–60 Hit@k and zero-result rate are useful directionally. MRR has wide confidence intervals — take it loosely.
3–4 weeks 60–200+ Hit@k and MRR stabilize to ±5% with typical usage patterns. p95 latency is trustworthy. This is your baseline.
Ongoing cumulative Run eval:import + eval:run before and after any embedding/index change to detect regressions.

A few things that affect how fast you get there:

  • Query diversity matters more than volume. Twenty distinct queries with one fetch each are more useful than 200 repetitions of the same query. The correlator groups by event_id so duplicates don't inflate case counts.
  • The 30-minute maxDelayMs window is conservative — real "I found what I searched for" fetches tend to happen within seconds. If you see suspiciously low correlation rates you can tighten that window in the import call.
  • Start with eval:generate now. The synthetic corpus doesn't need any production data and can catch embedding/index issues immediately. Run it today and confirm hit@5 ≈ 1.0 on the isolated cases before you flip the logging flag.
  • Once you enable the flag, the first eval:import after a week will tell you what your real-world zero-result rate is — that's often the most actionable early signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants