Skip to content

feat(lifecycle): artifact-lifecycle loop generate→measure→promote→compose — closes #267#364

Merged
drewstone merged 5 commits into
mainfrom
feat/artifact-lifecycle-267
Jun 22, 2026
Merged

feat(lifecycle): artifact-lifecycle loop generate→measure→promote→compose — closes #267#364
drewstone merged 5 commits into
mainfrom
feat/artifact-lifecycle-267

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What this closes

The artifact-lifecycle loop, end to end, on top of the phase-1 foundation (ArtifactRegistry + measureMarginalLift + applyArtifact). The binding problem: an empty profile has no skills, and nothing creates one, measures its value, and folds it back in a gated, provenance-tracked way. This wires that loop and proves it closes.

Plain-language frame

We make an agent self-improve a piece of its profile. The loop creates a candidate piece, measures how many extra problems it solves on a held-back exam (fresh problems it never tuned on), promotes it only if it clears that exam, stores it with the score as a receipt, and folds the winners back into the agent's profile.

What's new (wiring existing engines, not rebuilding them)

  • runLifecycle — the ONE surface-agnostic orchestrator: GENERATE (per-surface CandidateGenerator) → MEASURE each via measureMarginalLift on the held-back split → PROMOTE via a pluggable PromotionGate → STORE in ArtifactRegistry with provenance (domain, generation, generator kind, gate verdict) + the lift score.
  • CandidateGenerator (generator.ts) — the thin per-surface seam, the ONLY per-surface code. The interface the next stages implement.
  • PromotionGate (gate.ts) — thresholdPromotionGate (scalar lift) and heldOutPromotionGate, which delegates to agent-eval's HeldOutGate (paired-bootstrap CI on per-task holdout records). The held-out gate fails loud if the eval produced no per-task records — a significance claim with no data behind it is forbidden.
  • Registry invariantpromoteWithLift records the measured lift; liftOf returns it. An artifact is active IFF it carries a finite lift. composeProfile folds the top-k active artifacts ranked by lift back into a profile; a status flag without a lift receipt is invisible.
  • skillGenerator (skill-generator.ts) — DISTILL (create a skill from traces — the step skillOpt cannot do) then REFINE (optimize it). Both are injected seams (§1.5: author the profile, don't embed a loop). This is the literal answer to "empty profile has no skills".
  • lifecycles field on defineAgent — declarative per-surface config the loop reads (surface + generator + gate + compose-k).

The end-to-end proof (the keystone)

src/lifecycle/closed-loop.test.ts — deterministic, no live model:

EMPTY profile (no skills) → runLifecycle distills a skill from seeded traces → measures its held-back lift (0 → 1) → the gate promotes it → composeProfile folds it back → the composed profile beats the empty one on the same held-back exam.

A second case proves a worthless distilled skill earns zero lift, fails the gate, and never composes in.

Verification

  • pnpm run build — clean; new exports present in dist/lifecycle.d.ts + dist/agent.d.ts.
  • pnpm run typecheck (incl. examples) — clean.
  • pnpm test — 110 files / 1085 pass, 1 pre-existing skip. Lifecycle suite: 32 pass (21 phase-1 + 2 closed-loop proof + 9 gate/compose).
  • pnpm run lint — clean.
  • Merges clean into origin/main.

CandidateGenerator interface the next stages implement

export interface CandidateGenerator<K extends ArtifactKind = ArtifactKind> {
  kind: K
  generate(ctx: GenerateContext): Promise<ArtifactInput<K>[]>
}

export interface GenerateContext {
  baseline: AgentProfile
  domain: string
  findings: ReadonlyArray<AnalystFinding>
  traces?: unknown
  signal?: AbortSignal
}

A generator proposes UNMEASURED candidate artifacts for one surface; runLifecycle owns register → measure → gate → store. skillGenerator is the reference implementation. The next stages add toolGenerator, promptGenerator, mcpGenerator against this same interface.

🤖 Generated with Claude Code

…pose

Close the artifact-lifecycle loop on top of the phase-1 foundation
(ArtifactRegistry + measureMarginalLift + applyArtifact):

- runLifecycle: the one surface-agnostic orchestrator — generate (per-surface
  CandidateGenerator) → measure each via measureMarginalLift on the held-back
  split → promote via a pluggable PromotionGate → store with provenance + lift.
- CandidateGenerator: the thin per-surface seam (the only per-surface code);
  generator.ts is the interface the next stages implement.
- PromotionGate: thresholdPromotionGate (scalar) + heldOutPromotionGate
  (delegates to agent-eval HeldOutGate, paired-bootstrap on per-task holdout
  records; fails loud without them — no fabricated significance).
- Registry invariant: promoteWithLift stamps the measured lift; an artifact is
  active IFF liftOf returns a finite number. composeProfile folds the top-k
  active artifacts (ranked by lift) back into a profile.
- skillGenerator: distill (create a skill from traces — the step skillOpt
  cannot do) then refine (optimize it) — the answer to "empty profile has no
  skills". Both steps are injected seams.
- lifecycles field on defineAgent: declarative per-surface config the loop reads.
- closed-loop.test.ts: the deterministic end-to-end proof — empty profile →
  distill → measure → promote → compose beats the empty profile on a held-back
  exam. The loop is closed end-to-end.

Closes #267.
tangletools
tangletools previously approved these changes Jun 22, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — f0a5fb1b

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T17:32:02Z

…+ diverse-seed basin escape

Adds promptGenerator — the per-surface adapter for the PROMPT lever of the
artifact-lifecycle loop, mirroring skillGenerator for skills. Each candidate is
a `prompt` artifact ({ instruction }) that applyArtifact appends to
profile.prompt.instructions, measured/gated/composed by the existing
surface-agnostic runLifecycle.

Two candidate sources per generation:
  - REFINE — wraps agent-eval's gepaProposer (reflective incumbent-grounded
    rewrites). The exploit arm.
  - SEED — authors N genuinely diverse fresh instruction lines from the task
    spec, each forced to a distinct framing, NOT mutations of the incumbent.
    The explore arm: lets the search jump basins instead of polishing one local
    minimum. gepaProposer structurally cannot do this — it only perturbs the
    current surface.

Both seams are injected (pure stubs in tests, router-backed in
productionPromptGenerator); the diverse author is one callLlmJson call at high
temperature. Focused test proves the loop closes through the prompt surface AND
that a refine-only run stays stuck while a seed escapes to a promoted +1.0 lift.
tangletools
tangletools previously approved these changes Jun 22, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 9b9b0768

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T17:39:41Z

…dispatch

The buildable surfaces (tool, mcp) differ from prompt/skill: a candidate is
CODE that must compile, pass its tests, and — for an MCP — actually boot and
serve, which you cannot one-shot. So `buildableGenerator` is a supervisor
dispatch, not a single author:

  FAN OUT  N parallel candidate builds, each in its own git worktree, built by
           a real coding harness (research → implement → test → prove it
           compiles / serves). The per-candidate build is the `buildCandidate`
           seam — injectable (pure stub in tests, real harness in prod).
  FILTER   keep only VERIFIED builds (the verifier is the gate; an unverified
           worktree is never a candidate — same valid-only discipline as
           worktreeFanout's selectValidWinner).
  RANK     score each survivor by measureMarginalLift against the baseline (the
           held-back ablation — same selector the lifecycle uses; never a judge).
  EMIT     put forward the single best survivor as one tool/mcp artifact, with
           the measured lift + winning worktree ref as provenance.

The dispatch is surface-agnostic and lives in tool-generator.ts; production
wiring (tool-build.ts `worktreeBuildCandidate`) composes only shipped engines —
gitWorktreeAdapter + agenticGenerator + toolBuildPrompt/mcpBuildPrompt +
commandVerifier/mcpServeVerifier — NO new execution model. The emitted winner
still flows through runLifecycle's own measure → gate → promote → compose loop.

Focused test stubs the per-candidate build (no harness/process, cheap CI) but
exercises the real dispatch: fan-out width, verified-only filter, best-of-N
lift rank, parallel execution, and end-to-end flow into runLifecycle.
tangletools
tangletools previously approved these changes Jun 22, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — c09f8dce

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T17:46:49Z

…t#267 stage 4)

Complete the artifact state machine candidate→active→{decayed,retired} and add
the two maintenance stages that keep the active set honest after promotion:

- driftWatch: a scheduled re-measure of every active artifact that re-runs the
  same measureMarginalLift ablation over the current baseline and DEMOTES
  (active→decayed) any whose held-back lift fell below the keep-bar (absolute
  minLift and/or relative maxRelativeDecay). Reversible: a recovered lift
  re-promotes. Shares one baseline arm across the set.
- dedupeArtifacts: a measurement judge over active artifact PAIRS that retires
  (→retired, terminal) the weaker member of any pair whose lifts do not stack
  (combined < sum − tolerance). Caches per-artifact lift across pairs; collapses
  a mutually-redundant cluster to its strongest member.

Both are surface-agnostic and reuse the existing EvalRunner + measureMarginalLift
+ applyArtifacts bridge — no new execution model. Registry gains demote()/retire()
transitions (and lifecycleReasonKey for the audit trail); status 'promoted' is
renamed to the canonical 'active', and composeProfile/liftOf count only active
artifacts so decayed/retired drop out of composed profiles automatically.

Regenerated docs/api (TypeDoc) to keep the freshness gate green.
tangletools
tangletools previously approved these changes Jun 22, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — d45e4e30

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T17:56:52Z

…pi row + regenerated api docs

Bump to 0.74.0. Add the artifact-lifecycle decision-table row (runLifecycle →
composeProfile → driftWatch/dedupeArtifacts; thin per-surface CandidateGenerator).
Regenerate docs/api after drift-watch.ts/dedupe.ts became git-tracked so their
source citations link instead of falling back to bare paths.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 2795bb12

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-22T18:00:36Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants