Skip to content

pathrule/benchmarks

Repository files navigation

Pathrule Benchmarks

Path-scoped knowledge delivery vs. a monolithic instruction dump

Scope Clients License

Same repository. Same synthetic knowledge. Same prompts. Different delivery.

Benchmark result overview

The question

Coding agents need project knowledge: decisions, rules, procedures, and local conventions. The simplest delivery mechanism is one large root instruction file. Pathrule instead compiles the same knowledge into native, path-scoped instruction files so the client loads the relevant project slice.

This benchmark asks:

What happens to answer quality, token usage, and duration when identical project knowledge is delivered as one monolithic file or as native path-scoped instructions?

This first public snapshot is deliberately narrow: hard tier, English, three runs per cell. Testing is ongoing. New tiers, languages, models, and completed cells will be added without rewriting or hiding this snapshot.

Results

All four published cells completed 3/3 runs. Values below are medians across the three full ten-prompt sessions.

The primary efficiency metric is total footprint (every token the model processes per turn): the provider-neutral measure of how much context each delivery puts in front of the model. Non-cached tokens are the billable subset after prompt caching, shown alongside; a static dump caches heavily, so non-cached understates its footprint.

Claude Opus 4.8

Metric Monolithic Pathrule Change
Fact accuracy 100.0% 100.0% 0.0 pp
Action accuracy 100.0% 100.0% 0.0 pp
Total token footprint 417,167 198,069 -52.5%
Non-cached tokens 30,918 16,084 -48.0%
Duration 69.2 s 69.4 s +0.2%

Pathrule preserved every measured fact and required action while cutting the median total footprint by 52.5% (billable non-cached tokens by 48.0%). Duration was effectively flat.

OpenAI Codex GPT-5.5

Metric Monolithic Pathrule Change
Fact accuracy 95.2% 93.7% -1.6 pp
Action accuracy 50.0% 83.3% +33.3 pp
Total token footprint 412,433 241,849 -41.4%
Non-cached tokens 30,287 27,682 -8.6%
Duration 129.9 s 105.7 s -18.6%

The Codex result is mixed and is reported as such: Pathrule used fewer tokens, completed faster, and followed more required actions, while fact accuracy fell by 1.6 percentage points.

Combined view

Client Quality result Total footprint Non-cached Duration
Claude Opus 4.8 Facts and actions unchanged -52.5% -48.0% +0.2%
OpenAI Codex GPT-5.5 Facts -1.6 pp; actions +33.3 pp -41.4% -8.6% -18.6%

No forbidden fact or action hit occurred. Every unknown-fact prompt was answered with the expected abstention, and every response used the requested language.

What was tested

Property Published scope
Repository Fastify v5.8.5 at 3983cce8124714242099e8756a7a9a80a0ba0aea
Fixture Synthetic hard tier: 168 knowledge records
Knowledge mix 7 relevant, 4 hard negatives, 157 unrelated
Session 10 ordered English prompts, conversation state retained
Clients Claude Opus 4.8 and OpenAI Codex GPT-5.5
Repetitions N=3 per client and variant
Scoring Mechanical expected facts, actions, forbidden hits, and abstention
Isolation Fresh worktree and isolated client/runtime state per run

The fixture uses fabricated project knowledge layered over a pinned public Fastify snapshot. The contamination audit verifies that hidden expected knowledge does not appear in the repository or prompt text.

Variants

monolithic renders the complete 168-record corpus into one root native instruction file: CLAUDE.md for Claude or AGENTS.md for Codex.

pathrule-current compiles the same canonical records into native path-scoped instructions and navigation metadata.

Scope note: pathrule-current = native path-scoped compilation + navigation; semantic embedding ranking (BYO key / Cloud) is an additive layer not exercised in these cells.

No Pathrule read MCP server was configured for the published runs. The comparison is between two native instruction-delivery layouts, generated from the same knowledge.

Evidence

Reproduce

Requirements:

  • Node.js >=20.11.1
  • authenticated claude and/or codex CLI
  • a local Pathrule source checkout
npm ci
npm run fetch:repository
npm run build:fixtures
npm test

Inspect the exact paid execution graph without starting a model:

npm run bench -- --dry-run \
  --tiers hard \
  --clients claude,codex \
  --variants monolithic,pathrule-current \
  --runs 3 \
  --pathrule-repo ../pathrule

Execute the matrix:

npm run bench -- \
  --tiers hard \
  --clients claude,codex \
  --variants monolithic,pathrule-current \
  --runs 3 \
  --pathrule-repo ../pathrule \
  --resume

npm run sanitize:results
npm run report

Model calls may incur provider charges. Run --dry-run first.

Honesty policy

  • Quality is shown before efficiency.
  • Missing metrics remain missing; they are not estimated.
  • Failed, timed-out, and interrupted cells remain in the run log.
  • Pathrule losses are published beside wins.
  • A cell needs at least three completed runs before supporting a public claim.
  • Observations and architectural explanations are kept separate.

See METHODOLOGY.md for the complete protocol.

License

Apache-2.0.

About

Open, reproducible benchmark for Pathrule. Measures token, tool, and latency deltas of path-scoped context delivery across bare, OSS-core, and cloud editions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors