feat(lifecycle): artifact-lifecycle loop generate→measure→promote→compose — closes #267#364
Conversation
…pose Close the artifact-lifecycle loop on top of the phase-1 foundation (ArtifactRegistry + measureMarginalLift + applyArtifact): - runLifecycle: the one surface-agnostic orchestrator — generate (per-surface CandidateGenerator) → measure each via measureMarginalLift on the held-back split → promote via a pluggable PromotionGate → store with provenance + lift. - CandidateGenerator: the thin per-surface seam (the only per-surface code); generator.ts is the interface the next stages implement. - PromotionGate: thresholdPromotionGate (scalar) + heldOutPromotionGate (delegates to agent-eval HeldOutGate, paired-bootstrap on per-task holdout records; fails loud without them — no fabricated significance). - Registry invariant: promoteWithLift stamps the measured lift; an artifact is active IFF liftOf returns a finite number. composeProfile folds the top-k active artifacts (ranked by lift) back into a profile. - skillGenerator: distill (create a skill from traces — the step skillOpt cannot do) then refine (optimize it) — the answer to "empty profile has no skills". Both steps are injected seams. - lifecycles field on defineAgent: declarative per-surface config the loop reads. - closed-loop.test.ts: the deterministic end-to-end proof — empty profile → distill → measure → promote → compose beats the empty profile on a held-back exam. The loop is closed end-to-end. Closes #267.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — f0a5fb1b
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T17:32:02Z
…+ diverse-seed basin escape
Adds promptGenerator — the per-surface adapter for the PROMPT lever of the
artifact-lifecycle loop, mirroring skillGenerator for skills. Each candidate is
a `prompt` artifact ({ instruction }) that applyArtifact appends to
profile.prompt.instructions, measured/gated/composed by the existing
surface-agnostic runLifecycle.
Two candidate sources per generation:
- REFINE — wraps agent-eval's gepaProposer (reflective incumbent-grounded
rewrites). The exploit arm.
- SEED — authors N genuinely diverse fresh instruction lines from the task
spec, each forced to a distinct framing, NOT mutations of the incumbent.
The explore arm: lets the search jump basins instead of polishing one local
minimum. gepaProposer structurally cannot do this — it only perturbs the
current surface.
Both seams are injected (pure stubs in tests, router-backed in
productionPromptGenerator); the diverse author is one callLlmJson call at high
temperature. Focused test proves the loop closes through the prompt surface AND
that a refine-only run stays stuck while a seed escapes to a promoted +1.0 lift.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 9b9b0768
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T17:39:41Z
…dispatch
The buildable surfaces (tool, mcp) differ from prompt/skill: a candidate is
CODE that must compile, pass its tests, and — for an MCP — actually boot and
serve, which you cannot one-shot. So `buildableGenerator` is a supervisor
dispatch, not a single author:
FAN OUT N parallel candidate builds, each in its own git worktree, built by
a real coding harness (research → implement → test → prove it
compiles / serves). The per-candidate build is the `buildCandidate`
seam — injectable (pure stub in tests, real harness in prod).
FILTER keep only VERIFIED builds (the verifier is the gate; an unverified
worktree is never a candidate — same valid-only discipline as
worktreeFanout's selectValidWinner).
RANK score each survivor by measureMarginalLift against the baseline (the
held-back ablation — same selector the lifecycle uses; never a judge).
EMIT put forward the single best survivor as one tool/mcp artifact, with
the measured lift + winning worktree ref as provenance.
The dispatch is surface-agnostic and lives in tool-generator.ts; production
wiring (tool-build.ts `worktreeBuildCandidate`) composes only shipped engines —
gitWorktreeAdapter + agenticGenerator + toolBuildPrompt/mcpBuildPrompt +
commandVerifier/mcpServeVerifier — NO new execution model. The emitted winner
still flows through runLifecycle's own measure → gate → promote → compose loop.
Focused test stubs the per-candidate build (no harness/process, cheap CI) but
exercises the real dispatch: fan-out width, verified-only filter, best-of-N
lift rank, parallel execution, and end-to-end flow into runLifecycle.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — c09f8dce
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T17:46:49Z
…t#267 stage 4)
Complete the artifact state machine candidate→active→{decayed,retired} and add
the two maintenance stages that keep the active set honest after promotion:
- driftWatch: a scheduled re-measure of every active artifact that re-runs the
same measureMarginalLift ablation over the current baseline and DEMOTES
(active→decayed) any whose held-back lift fell below the keep-bar (absolute
minLift and/or relative maxRelativeDecay). Reversible: a recovered lift
re-promotes. Shares one baseline arm across the set.
- dedupeArtifacts: a measurement judge over active artifact PAIRS that retires
(→retired, terminal) the weaker member of any pair whose lifts do not stack
(combined < sum − tolerance). Caches per-artifact lift across pairs; collapses
a mutually-redundant cluster to its strongest member.
Both are surface-agnostic and reuse the existing EvalRunner + measureMarginalLift
+ applyArtifacts bridge — no new execution model. Registry gains demote()/retire()
transitions (and lifecycleReasonKey for the audit trail); status 'promoted' is
renamed to the canonical 'active', and composeProfile/liftOf count only active
artifacts so decayed/retired drop out of composed profiles automatically.
Regenerated docs/api (TypeDoc) to keep the freshness gate green.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — d45e4e30
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T17:56:52Z
…pi row + regenerated api docs Bump to 0.74.0. Add the artifact-lifecycle decision-table row (runLifecycle → composeProfile → driftWatch/dedupeArtifacts; thin per-surface CandidateGenerator). Regenerate docs/api after drift-watch.ts/dedupe.ts became git-tracked so their source citations link instead of falling back to bare paths.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 2795bb12
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T18:00:36Z
What this closes
The artifact-lifecycle loop, end to end, on top of the phase-1 foundation (
ArtifactRegistry+measureMarginalLift+applyArtifact). The binding problem: an empty profile has no skills, and nothing creates one, measures its value, and folds it back in a gated, provenance-tracked way. This wires that loop and proves it closes.Plain-language frame
We make an agent self-improve a piece of its profile. The loop creates a candidate piece, measures how many extra problems it solves on a held-back exam (fresh problems it never tuned on), promotes it only if it clears that exam, stores it with the score as a receipt, and folds the winners back into the agent's profile.
What's new (wiring existing engines, not rebuilding them)
runLifecycle— the ONE surface-agnostic orchestrator: GENERATE (per-surfaceCandidateGenerator) → MEASURE each viameasureMarginalLifton the held-back split → PROMOTE via a pluggablePromotionGate→ STORE inArtifactRegistrywith provenance (domain, generation, generator kind, gate verdict) + the lift score.CandidateGenerator(generator.ts) — the thin per-surface seam, the ONLY per-surface code. The interface the next stages implement.PromotionGate(gate.ts) —thresholdPromotionGate(scalar lift) andheldOutPromotionGate, which delegates to agent-eval'sHeldOutGate(paired-bootstrap CI on per-task holdout records). The held-out gate fails loud if the eval produced no per-task records — a significance claim with no data behind it is forbidden.promoteWithLiftrecords the measured lift;liftOfreturns it. An artifact is active IFF it carries a finite lift.composeProfilefolds the top-k active artifacts ranked by lift back into a profile; a status flag without a lift receipt is invisible.skillGenerator(skill-generator.ts) — DISTILL (create a skill from traces — the stepskillOptcannot do) then REFINE (optimize it). Both are injected seams (§1.5: author the profile, don't embed a loop). This is the literal answer to "empty profile has no skills".lifecyclesfield ondefineAgent— declarative per-surface config the loop reads (surface + generator + gate + compose-k).The end-to-end proof (the keystone)
src/lifecycle/closed-loop.test.ts— deterministic, no live model:A second case proves a worthless distilled skill earns zero lift, fails the gate, and never composes in.
Verification
pnpm run build— clean; new exports present indist/lifecycle.d.ts+dist/agent.d.ts.pnpm run typecheck(incl. examples) — clean.pnpm test— 110 files / 1085 pass, 1 pre-existing skip. Lifecycle suite: 32 pass (21 phase-1 + 2 closed-loop proof + 9 gate/compose).pnpm run lint— clean.origin/main.CandidateGenerator interface the next stages implement
A generator proposes UNMEASURED candidate artifacts for one surface;
runLifecycleowns register → measure → gate → store.skillGeneratoris the reference implementation. The next stages addtoolGenerator,promptGenerator,mcpGeneratoragainst this same interface.🤖 Generated with Claude Code