feat: add retrieval event logging and evaluation harness by chrishonson · Pull Request #5 · chrishonson/Metacortex

chrishonson · 2026-06-29T20:52:28Z

Summary

Opt-in retrieval observability + a full offline + production retrieval evaluation harness.

What

When RETRIEVAL_EVENT_LOGGING_ENABLED=true, search_context and fetch_context now write rich events (full query, filters, ranked result ids + derived scores, fetch ids) to a dedicated retrieval_query_events collection (90 d TTL, deny-all rules, linked to the regular tool events).
New retrievalEvaluation module + CLI:
- eval:generate – creates a deterministic synthetic corpus in the isolated memory_vectors_eval collection and the corresponding golden retrieval_eval_cases.
- eval:import – converts real production search→fetch observations (via correlateImplicitFetches) into labeled cases.
- eval:run – executes the cases (either directly against the eval partition or against a live MCP endpoint) and computes standard metrics (hit@k, recall@k, MRR, zero-result rate, p50/p95 latency).
Separate eval collection + indexes so benchmark work never touches production memory.
Updated wiring in config, MCP server, observer, tests, smoke test, docs, and Firestore rules/indexes/TTL script.

Motivation

Gives us repeatable, automatable measurement of retrieval quality using either controlled fixtures or actual usage signals (successful fetches after searches are treated as positives).

New scripts

npm --prefix functions run eval:generate
npm --prefix functions run eval:import -- --lookback-hours 168
npm --prefix functions run eval:run -- --mode isolated   # or production

Verification

npm --prefix functions test → 49/49 passing
New unit + integration coverage for the eval paths and event emission
Smoke test updated with --fetch-first helper

See the "Retrieval evaluation" section added to the README for usage.

- Add RETRIEVAL_EVENT_LOGGING_ENABLED (default false) for detailed search/fetch telemetry in separate retrieval_query_events collection (90-day TTL via Firestore, linked to parent tool_call events) - New retrievalEvaluation.ts implementing: - FirestoreRetrievalEvalCaseStore + replace/list helpers - buildSyntheticEvalCases + correlateImplicitFetches (search+fetch pairs within time window become positive labels) - computeRetrievalEvalMetrics (hit@k, recall@k, MRR, zero-result rate, p50/p95 latency) - New functions/scripts/retrieval-eval.ts + package.json scripts: eval:generate (plant isolated corpus + cases in memory_vectors_eval) eval:import (turn recent production retrieval events into cases) eval:run (--mode isolated|production, --cutoffs, reports JSON metrics) - Wire new config (retrievalEventLoggingEnabled, topK, memoryCollection) through app.ts into createMetaCortexMcpServer and observer - mcpServer: extend execute wrapper to accept optional retrieval descriptors; search_context and fetch_context now emit rich events when enabled (with scores converted from cosine distance) - Test updates: InMemoryToolCallObserver.listRetrievalEvents, integration assertions, config boolean parser tests, fakes - Smoke test: --fetch-first / MCP_FETCH_FIRST to drive fetch after search - Infra: firestore indexes for memory_vectors_eval, rules deny for new collections (retrieval_query_events, retrieval_eval_cases, *_eval*), deploy-firestore-ttl.sh updated - Docs + .env.example updated All tests green (49 passing).

chrishonson · 2026-06-29T21:54:20Z

@claude review and how long do you think i'll need to run this in prod before we start looking at the analytics?

claude · 2026-06-29T21:54:36Z

Claude finished @chrishonson's task in 2m 26s —— View job

Code Review + Analytics Timeline

Tasks

Gather context
Read key new files (retrievalEvaluation.ts, observability.ts, config.ts, mcpServer.ts)
Read supporting files (firestore.indexes.json, firestore.rules, tests)
Post review findings
Answer analytics timeline question

Review

Overall this is clean, well-structured work. The architecture — opt-in flag, separate collection, dedicated eval partition, implicit-fetch correlation for free labels — is solid. A few things worth looking at:

Correctness / Subtle Bugs

recall@k averaging includes zero-positive observations in denominator (retrievalEvaluation.ts:259–276)

When positiveIds.size === 0, the function returns total unchanged (contributes 0), but safeAverage still divides by observations.length including those cases. So an eval case with no positive IDs is treated as 0 recall and silently drags the metric down. In practice positive_ids should never be empty (the correlator only produces cases with ≥1 positive), but if even one slips through it'll corrupt recall@k without a warning. A quick guard would close this:

// at the top of the reduce callback for recall
if (positiveIds.size === 0) return total; // ← today this silently penalizes; instead skip denominator too

Fix this →

deleteCollection has no guard against infinite looping (retrieval-eval.ts:333–349)

The while (true) loop has no iteration cap or catch. If Firestore returns a persistent error mid-batch the loop hangs indefinitely. Since this only runs locally it's low-risk, but a maxIterations guard or a catch that breaks out would make failures debuggable.

Fetch error path emits no found field (mcpServer.ts:204–211)

On a 404 from fetch_context, the error path records retrieval: retrieval.request which has no found key. correlateImplicitFetches correctly filters to event.found === true so not-found fetches are excluded from labels. This is intentional and works, but it's non-obvious — a comment would help the next reader.

Minor / Nits

Home-rolled .env parser doesn't handle quoted values (retrieval-eval.ts:356–386)

loadEnvironment splits on the first = and takes the raw RHS. A value like GEMINI_API_KEY="abc def" would include the quotes in the env var. dotenv is probably already in the transitive dep graph — worth using it here rather than growing the parser.

MEMORY_EVENT_TTL_MS is reused for retrieval events (observability.ts:9,155)

Both collections end up with the same 90-day TTL, but the constant's name implies it belongs to memory_events. Either extract a shared AUDIT_EVENT_TTL_MS constant or add a comment explaining the shared TTL. Minor readability.

retrieval_eval_cases has no TTL (firestore.indexes.json, deploy-firestore-ttl.sh)

Probably intentional — eval cases persist across runs by design. But the TTL script covers three of the four new collections and is silent on why retrieval_eval_cases is excluded. A comment in the script would clarify intent.

await main() at top level has no error formatter (retrieval-eval.ts:492)

Node will print the stack trace on rejection, which is fine. But wrapping it in main().catch(err => { console.error(err.message); process.exitCode = 1; }) gives cleaner CLI output without the noise.

What's Good

Opt-in flag (RETRIEVAL_EVENT_LOGGING_ENABLED=false default) means existing deployments see no behavior change.
retrieval_query_events is completely server-side (deny-all rules) — no accidental client leakage.
correlateImplicitFetches scoping by client_id + memory_collection + containment-in-results is exactly right; it avoids cross-client label contamination.
Isolated eval collection (memory_vectors_eval) with its own indexes means benchmark runs can't corrupt or co-mingle with production vectors.
schema_version: 1 field on RetrievalEvalCase gives a forward migration path.
Tests are lean and cover the key behaviors (synthetic case building, correlation logic, metrics, partition replacement).
The label_source: "implicit_fetch" field is a nice explicit note that these labels are derived, not human-judged.

How Long to Run in Production Before Looking at Analytics

tl;dr: ~1 week for a first signal, 3–4 weeks for a stable baseline you can regression-test against.

The import-production command (--lookback-hours 168 = 7 days) constructs one eval case per unique search→fetch pair. The quality of the metrics depends entirely on how many of those pairs you accumulate:

Volume	Cases	What you can trust
< 1 week	< 20	Only the synthetic isolated eval (`eval:generate`) is meaningful. Don't draw conclusions from production data yet.
1–2 weeks	20–60	Hit@k and zero-result rate are useful directionally. MRR has wide confidence intervals — take it loosely.
3–4 weeks	60–200+	Hit@k and MRR stabilize to ±5% with typical usage patterns. p95 latency is trustworthy. This is your baseline.
Ongoing	cumulative	Run `eval:import` + `eval:run` before and after any embedding/index change to detect regressions.

A few things that affect how fast you get there:

Query diversity matters more than volume. Twenty distinct queries with one fetch each are more useful than 200 repetitions of the same query. The correlator groups by event_id so duplicates don't inflate case counts.
The 30-minute maxDelayMs window is conservative — real "I found what I searched for" fetches tend to happen within seconds. If you see suspiciously low correlation rates you can tighten that window in the import call.
Start with eval:generate now. The synthetic corpus doesn't need any production data and can catch embedding/index issues immediately. Run it today and confirm hit@5 ≈ 1.0 on the isolated cases before you flip the logging flag.
Once you enable the flag, the first eval:import after a week will tell you what your real-world zero-result rate is — that's often the most actionable early signal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add retrieval event logging and evaluation harness#5

feat: add retrieval event logging and evaluation harness#5
chrishonson wants to merge 1 commit into
mainfrom
feat/retrieval-evaluation

chrishonson commented Jun 29, 2026

Uh oh!

chrishonson commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chrishonson commented Jun 29, 2026

Summary

What

Motivation

New scripts

Verification

Uh oh!

chrishonson commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review + Analytics Timeline

Review

Correctness / Subtle Bugs

Minor / Nits

What's Good

How Long to Run in Production Before Looking at Analytics

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Jun 29, 2026 •

edited

Loading