bun-server-bench

Can a coding agent build a correct backend service — not just code that looks right, but code that holds up under the edge cases production throws at it?

bun-server-bench is a benchmark and trajectory dataset that answers that question for one narrow, high-signal domain: real Bun server engineering. HTTP semantics, authentication, SQLite transactions, idempotency, concurrency, rate limiting, queues, observability, WebSockets, uploads. Fifty versioned tasks, each one engineered so that a plausible-but-wrong implementation passes the visible tests and fails the hidden ones.

It is not a framework. It is not a throughput benchmark. A fast server that returns the wrong status code scores zero.

The 30-second proof

Take idempotency.payment-capture.v1. The agent must build a payment endpoint that never double-charges on retries. The public tests check the obvious path: same key + same body replays the original response; same key + different body conflicts.

A reasonable implementation passes all of that. Then the hidden suite fires five identical requests at the same key simultaneously:

const responses = await Promise.all(
  Array.from({ length: 5 }, () =>
    capture(key, { amount: 777, currency: "USD" })
  )
);
const uniqueIds = new Set(
  (await Promise.all(responses.map((r) => r.json()))).map((b) => b.id)
);
expect(uniqueIds.size).toBe(1); // exactly one payment, not five

Any solution that checks the key map and then creates the payment across an await boundary creates five payments and fails. Passing requires a per-key single-flight lock. That gap — between code that looks correct and code that is correct — is what every task in this benchmark is built to measure.

Why this measures something real → docs/thesis.md How the traps are engineered → docs/task-anatomy.md

Why it exists

Frontier models now saturate general coding suites. Near-perfect scores stop telling you where agents still fail. bun-server-bench narrows the domain to production-shaped Bun services — where small contract mistakes (validation order, a missing transaction, an off-by-one cursor) are the whole game — and engineers discrimination into every task so the score keeps carrying signal.

It was built during TinyComputer's research into whether small specialized models can match frontier behavior on a narrow engineering domain. The benchmark stands on its own: the tasks, scoring, integrity guarantees, Harbor packages, and trajectory exports are useful to anyone evaluating or training coding agents.

At a glance


Authored tasks	50
Exported Harbor packages	50
Public / hidden test suites	50 / 50
Reference solutions	50
Runtime dependencies allowed per task	0

Difficulty (1 easiest → 5 hardest): 1→7 · 2→3 · 3→2 · 4→20 · 5→18. Splits: train 4 · dev 44 · public_eval 0 · private_eval 2. See docs/splits-and-leakage.md.

Where to go next

If you are…	Read
Deciding whether this is worth your time	docs/thesis.md → docs/results.md
Evaluating an agent	docs/guides/evaluate-your-agent.md
Training a model on the trajectories	docs/guides/train-on-trajectories.md
Contributing a task	docs/guides/contribute-a-task.md
Wondering whether agents can cheat it	docs/integrity.md
Looking for the normative spec	docs/reference/

Full documentation map: docs/README.md.

Install and run one task

bun install
bun run validate          # all 50 tasks structurally valid

Run a task's reference solution end-to-end (start the service, run public + hidden tests, score it):

bun run run:reference tasks/http-apis.todo-health.v1

Run an agent against a task:

bun run run:agent --task tasks/authentication.jwt-verify.v1 --agent claude-code

Run a published package through Harbor, the canonical execution engine:

harbor run -p harbor/databases-optimistic-version-v1 --agent oracle -e docker -y

Each run writes artifacts (prompt, patch, logs, score) under runs/<timestamp>-<task-id>/. The full quickstart — suites, concurrency, resume, exports — is in docs/quickstart.md.

How scoring works

Scoring is a gate. There is no partial credit for almost-correct.

Outcome	Score
Public and hidden tests pass	100
Public pass, hidden fail	25
Public fail, or install / startup / timeout failure	0

Harbor packages emit the same contract via reward.txt (1.0 / 0.25 / 0.0). Details, including the forward-looking weighted-scoring schema that the runner does not yet enforce, are in docs/reference/scoring.md.

Why Bun

Bun compresses a modern server stack into a small surface area: Bun.serve gives HTTP and WebSocket primitives without framework ceremony, bun:sqlite enables real persistence and transactions with no external service, and native TypeScript keeps the task loop tight. The runtime is young enough that memorization pressure is low, and the domain is narrow enough to train a specialist yet rich enough to demand real engineering judgment. Every task ships with zero runtime dependencies and network disabled — the agent must implement the capability, not import it. See docs/integrity.md.

License

Tasks declare Apache-2.0. Preserve license metadata when redistributing tasks, Harbor packages, or dataset exports.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.chunks		.chunks
.github		.github
.vscode		.vscode
agents		agents
datasets		datasets
docs		docs
harbor		harbor
runners		runners
schemas		schemas
scripts/release		scripts/release
tasks		tasks
validators		validators
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.prettierignore		.prettierignore
.release-it.json		.release-it.json
LICENSE		LICENSE
README.md		README.md
biome.jsonc		biome.jsonc
bun.lock		bun.lock
bunfig.toml		bunfig.toml
commitlint.config.js		commitlint.config.js
lefthook.yml		lefthook.yml
package.json		package.json
prettier.config.mjs		prettier.config.mjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bun-server-bench

The 30-second proof

Why it exists

At a glance

Where to go next

Install and run one task

How scoring works

Why Bun

License

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

bun-server-bench

The 30-second proof

Why it exists

At a glance

Where to go next

Install and run one task

How scoring works

Why Bun

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages