feat: context compaction strategies for the react loop#996
Conversation
|
please add @ramon-astudillo as an observer |
|
(original PR body) SummaryAdds an optional Empirically on the BCP benchmark with Granite 4.1-8b, Backwards compatible: MotivationLong agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall. On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions,
Hardware / infrastructure for the table above:
Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency. DesignProtocol —
|
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
|
@yelkurdi I updated the PR body so that the update-pr-body check would pass (and copied your original body into a comment). Now that the check is passing, feel free to re-edit to include your original comments in the appropriate section. |
|
relevant related discussion #1099 |
|
Will update according to #1099 |
Component PR ChecklistUse this checklist when adding or modifying components in Protocol Compliance
Content Blocks
Integration
|
planetf1
left a comment
There was a problem hiding this comment.
Thanks for this — the Compactor architecture is really well thought through, and the test coverage in test_compactor.py is thorough. A few things to sort out before this lands, mostly around LLMSummarizeCompactor edge cases and a couple of quick doc fixes.
|
@planetf1 I have incorporated your feedback, I've also ran the tests and the full BCP deep search task tests (3 runs of each compaction method), all checks out. |
planetf1
left a comment
There was a problem hiding this comment.
Second pass after the latest fixes — most of the earlier thread was addressed cleanly, thanks. A few new things from this pass, plus one CI-blocker in the docs example.
Heads up — PR is currently in DIRTY merge state (conflicts against main). Worth a rebase before final review.
Findings inline; one separate top-level note on _run_coro_blocking as a follow-up suggestion (not reopening the resolved thread — your docstring update there is solid).
|
Following up on your docstring update on Have you considered routing the coroutine through The benefit is mostly for the sync-from-outside-loop case (someone calling Totally fine as a follow-up issue if you'd rather not expand scope here. |
Adds CompactionStrategy abstraction and KeepLastN implementation to mellea/stdlib/compaction.py, wires an optional compaction parameter into the react() loop, and adds full test coverage in test/stdlib/test_compaction.py. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Switches `CompactionStrategy.threshold` from a component-count trigger to a token-count trigger, read from the most recent `ModelOutputThunk.usage` populated by the backend. This aligns compaction with the real constraint (context size) and sidesteps per-backend tokenizer dependencies by using provider-reported usage; the trade-off is a one-turn lag since usage is recorded at the end of each model call. Also reorders the react loop so compaction runs after the final-answer check, skipping wasted work (and a wasted LLM call for LLMSummarize) on terminal turns. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Signed-off-by: ramon-astudillo <ramon@astudillo.com>
Move the compaction strategies alongside the react framework they serve: - mellea/stdlib/compaction.py -> mellea/stdlib/frameworks/react_compaction.py - test/stdlib/test_compaction.py -> test/stdlib/frameworks/test_react_compaction.py Imports and module docstrings updated accordingly. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
The docstring quality gate (tooling/docs-autogen/audit_coverage.py --quality --threshold 100) requires each documented symbol to have its own Args/Returns sections — inheritance from the abstract parent is not consulted. Six issues were reported against the compact() overrides on ClearAll, KeepLastN, and LLMSummarize. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Replaces the original async ``react_compaction`` strategies (ClearAll,
KeepLastN, LLMSummarize) with a generic, sync ``Compactor`` protocol
that operates on any ``Context``. ``ReACT`` and ``ChatContext`` are
rewired around the new protocol; sample callers, tests, and docs are
updated.
Squash of 29 Mellea-side commits from
context_compaction_for_react_2; the BCP eval harness commits in that
branch are intentionally excluded.
mellea/stdlib/context/ becomes a package
- Compactor protocol: sync ``compact(ctx, *, backend=None) -> Context``
- WindowCompactor(size, pin_predicate)
keep last-N body components; ``size=0`` clears the body and
retains only the pinned prefix
- ThresholdCompactor(inner, threshold)
token-gated wrapper that reads cumulative context size from the
most recent ModelOutputThunk's ``generation.usage`` and forwards
to ``inner.compact`` only above the gate
- LLMSummarizeCompactor(keep_n, pin_predicate, prompt_template)
summarizes old body components via the backend; the (async)
backend call is hidden behind a sync ``compact()`` via
``_run_coro_blocking`` so the protocol stays sync
- PinPredicate API: ``pin_nothing``, ``pin_system``,
``pin_system_and_initial_user``; chat compactors compose freely
mellea/stdlib/frameworks/react.py
- ``react()`` gains a ``compactor: Compactor | None = None`` per-turn
hook; invoked once after each tool observation
- The old ``react_compaction`` module is removed
mellea/stdlib/components/react.py
- ``pin_react_initiator``: a PinPredicate that pins everything up to
and including the first ``ReactInitiator``
- ``react_summary_prompt(goal=None, max_tokens_hint=None)``: factory
that returns a research-flavoured summary prompt template (with the
{conversation} placeholder LLMSummarizeCompactor expects). Optional
``GOAL: <goal>`` line and optional ``- Be at most ~N tokens`` bullet
when callers want goal anchoring or length-cap hints.
mellea/stdlib/context/chat.py
- ``ChatContext()`` defaults to no compactor (full history); pass
``compactor=`` or ``window_size=`` for opt-in compaction. Matches
upstream main's window_size=None unbounded semantics.
Test coverage
- test/stdlib/test_compactor.py (~500 LOC): protocol semantics;
Window / Threshold / LLMSummarize behaviours; pin-predicate edge
cases; ``size=0`` collapse; threshold gate edge cases
- test/stdlib/frameworks/test_react_framework.py (~210 LOC):
react() per-turn hook integration + react_summary_prompt
(default, goal interpolation, brace escaping, max_tokens_hint
bullet ordering, LLMSummarizeCompactor template-validation)
- test/stdlib/test_base_context.py: pin-non-compacting ChatContext
in the session-copy operations test (matches new opt-in default)
Net diff: 17 files, +381 / -896 lines (drops the old
react_compaction.py and its dedicated test file).
Backwards-compatible default behaviour preserved: bare ``ChatContext()``
retains full history; ``react()`` without ``compactor=`` behaves
identically to today; ``LLMSummarizeCompactor`` defaults to a generic
conversation-summary prompt unless callers opt in to the
research-flavoured variant via ``react_summary_prompt``.
Assisted-by: Claude Code
Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…ples - Correct comment in test_base_context.py: ChatContext defaults to compactor=None, not WindowCompactor(5). - Point compactor.py module docstring to docs/examples/context/ instead of the nonexistent docs/rewrite/. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…n LLMSummarize Restrict ChatContext(compactor=...) to compactors that inherit InlineCompactor. Wiring a backend-requiring compactor (e.g. LLMSummarizeCompactor) directly would invoke the backend on every add(); the new isinstance guard rejects that with a TypeError pointing at react(compactor=...), ThresholdCompactor, or manual compact() as alternatives. ThresholdCompactor remains accepted regardless of its inner -- it gates by token usage, so backend calls are sparse rather than per-add. LLMSummarizeCompactor now takes a required default_backend at construction and falls back to it when compact() is invoked without an explicit backend. A backend kwarg passed to compact() still overrides the default for that call. This makes ThresholdCompactor(LLMSummarizeCompactor(...)) work end-to-end when attached to ChatContext: at trip time, the inner uses its stored default_backend. InlineCompactor carries the compact() signature (raising NotImplementedError) so it's a usable static type without cast() workarounds. Specialized to ChatContext rather than parameterized over Context -- ThresholdCompactor's prior generic-T signature was unexercised, so the simpler shape applies. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…antic shift
Reviewer flagged that ChatContext(window_size=N) used to keep full history
in as_list() and only window view_for_generation(); the per-turn compactor
work made as_list() itself reflect the post-compaction state, which silently
undercounted in MelleaSession.cleanup() (interaction_count = len(as_list())).
- ChatContext docstring: add a Note describing the semantic shift and
pointing callers at out-of-band turn tracking when full counts matter.
- MelleaSession: turn ctx into a property; the setter increments
_interaction_count, and reset() / __init__ bypass via _ctx so lifecycle
events don't pollute the count. cleanup() now publishes
self._interaction_count, stable under any compaction strategy.
- SessionCleanupPayload: rewrite the interaction_count field doc to
match the new semantics ("turns committed" rather than "items in
context").
Assisted-by: Claude Code
Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged that an exception from the summarisation backend call (rate limit, network error, timeout) propagates through _run_coro_blocking and kills the entire react loop. For long-running research tasks that's quite painful, especially since compaction is best-effort by nature. LLMSummarizeCompactor.compact() now wraps _run_coro_blocking in a try/except Exception. On failure it logs a WARNING via MelleaLogger with the exception type and message, then returns ctx unchanged. The next compact() invocation retries; the conversation keeps growing in the meantime. BaseException (KeyboardInterrupt, SystemExit) still propagates so users can interrupt a stuck loop. Added a regression test with a backend that raises RuntimeError on every call: compact() returns the same ctx, original history is intact, and a warning naming the exception type is logged. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged that _run_coro_blocking, when invoked from inside an async caller like react(), blocks the entire event loop for the full duration of the wrapped coroutine — not just the calling task. The previous "fine for a serial ReACT loop" wording undersold the implications. Beefed-up Warning: block now spells out: - The loop, not just the thread, is stalled — callbacks, telemetry, cancellation signals, other sessions sharing the loop, keepalives are all blocked. - Backends with per-loop resources (notably httpx.AsyncClient) may behave unexpectedly because the coroutine runs on a fresh loop in a worker thread; documents the typical failure signatures. - Long-term direction is an async variant on the Compactor protocol so callers can await natively. Docs-only; no behavior change. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged silent drops when rendering the slice fed to the
summariser:
- Message: append "[N image(s) attached]" / "[M document(s) attached]"
markers; bytes not reproduced.
- ModelOutputThunk: render "assistant called tools: name({args}), ..."
for tool-call-only thunks (value=None) and "assistant: <empty>" for
empty thunks. Eliminates "assistant: None".
- Catch-all else: "<TypeName: content>" or "<TypeName>" instead of the
default object repr (e.g. ReactInitiator when not pinned).
Docstring gains a Note: that summaries are text-only and lossy for
multimodal / heavy-tool sessions. Tests cover each branch.
Assisted-by: Claude Code
Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…ng "<empty>" A "<empty>" marker for thunks with neither value nor tool_calls tended to leak into the resulting summary verbatim. These turns carry no information worth summarising, so drop them from the rendered slice entirely. Test updated to assert no line is emitted for an empty thunk while neighbouring turns still come through. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
…call Reviewer flagged that aact's context-type warning could be noisy under ThresholdCompactor-driven repeated compaction. Match react.py's pattern of setting silence_context_type_warning=True on internal framework calls so the warning stays quiet if the context argument is later changed to a non-SimpleContext, and to self-document the intent. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged that the summary call uses the backend's default max_tokens (often 256-512 on local backends), silently truncating long summaries. react_summary_prompt(max_tokens_hint=N) is only a soft prompt- side nudge, not a real API parameter. Add model_options: dict | None to the constructor and forward it to mfuncs.aact so callers can set a hard token budget (or any other backend option). Default None preserves existing behaviour. Tests cover both forwarded and default paths. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
The Compactor protocol is generic in T bound to Context, so a custom compactor returning a non-chat Context (e.g. SimpleContext) type-checks at the react() call site but breaks the ChatContext assumptions the next aact call relies on. Document the constraint on the compactor= parameter so callers don't get surprised at runtime. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer pointed out that items[1:] drops index 0 unconditionally, so a leading system message gets evicted on first compaction. Add a Note to the class docstring flagging the caveat and pointing readers at WindowCompactor + pin_system for the keep-the-system-prompt variant. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer pointed out there's no class-level T = declaration — T is a per-call TypeVar on compact(). Reword to describe the override annotation directly: chat-only compactors override compact() with a ChatContext signature. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Reviewer flagged the docstring gap: compact() raises NotImplementedError unconditionally on the marker base class, but the docstring had no Raises: section. Add one noting the always-raises behavior and that subclasses are expected to override. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
The build-and-validate CI job's --fail-on-quality threshold flagged 8 issues across pin_nothing, pin_system, pin_system_and_initial_user, and InlineCompactor.compact (missing Args + Returns). Add the sections. Also collapse a multi-line _run_coro_blocking call back to one line so ruff format passes. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
|
Thanks for addressing or replying to everything. I'll accept your explanation for the event loop blocking concern, but I think we need a follow-up issue and need Nathan's approval. |
|
Fix looks good, but can we rebase to current code and ensure no conflict issues. |
There was a problem hiding this comment.
Comparing against the summary of our previous discussion in #1099, in my opinion the PR conforms to that design. There's a couple of things we need to do before we can merge:
- rebase / resolve the merge conflicts
- update the PR description to match the new design
Resolve conflicts in session cleanup payload: keep out-of-band interaction_count tracking (turns committed) from this branch, and adopt session_id + exception fields added upstream. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
|
DCO check is failing on this commit: 1e28704
@ramon-astudillo I think this is yours. |
|
This commit will get squashed anyway, but let me check with @yelkurdi |
|
@psschwei I resolved the conflicts and now I'm in the process of updating the PR description. |
5b01e1e to
6daa892
Compare
|
PR description updated. |
planetf1
left a comment
There was a problem hiding this comment.
Third pass — the design work here is solid and the earlier findings are all cleaned up. One thing that needs fixing before merge: six fake Backend subclasses (five in test_compactor.py, one in react_compaction.py) override generate_from_raw instead of the abstract _generate_from_raw that #1264 introduced. The whole TestLLMSummarizeCompactor class errors at fixture setup as a result — details in the inline comment. A rename in each fake unblocks it.
CI is now running and the failing quality check is the active gate — no need for a separate review block on top of it.
The Backend ABC's abstract method was renamed to _generate_from_raw (leading underscore) in generative-computing#1264; the public generate_from_raw is now a concrete hook-firing wrapper. The fakes in test_compactor.py overrode the wrapper but left the abstract method unimplemented, so each backend raised TypeError on instantiation. Rename generate_from_raw -> _generate_from_raw in FakeBackend, OtherBackend, BrokenBackend, BuggyBackend, and InterruptingBackend. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
Mirror the guard on the aact next_context returns: the LLMSummarizeCompactor docstring notes a non-ChatContext return would break the loop, but nothing enforced it. Add an isinstance assert after compact() to fail fast. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>
|
looks like we have another merge conflict (and the CI seems to have gotten stuck, but fixing the conflicts should give it a kick) |

Component PR
Use this template when adding or modifying components in
mellea/stdlib/components/.Description
Implementation Checklist
Protocol Compliance
parts()returns list of constituent parts (Components or CBlocks)format_for_llm()returns TemplateRepresentation or string_parse(computed: ModelOutputThunk)parses model output correctly into the specified Component return typeContent Blocks
Integration
mellea/stdlib/components/__init__.pyor, if you are adding a library of components, from your sub-moduleTesting
tests/components/Attribution
Summary
Adds a generic, sync
Compactorprotocol tomellea.stdlib.contextfor shrinking a runningContext, with three implementations —WindowCompactor,ThresholdCompactor,LLMSummarizeCompactor— plus composable pin predicates that decide which leading components survive. ReACT gains an optionalcompactor=parameter;ChatContextgainscompactor=/window_size=for per-append compaction.Empirically on the BCP benchmark with Granite 4.1-8b, LLM-summary compaction cuts inference cost by 23.7% and raises accuracy by 3.5 pp — a dual win, not a cost/quality trade-off. Backwards compatible: the defaults preserve existing behavior exactly.
Motivation
Long agentic loops pile up tool responses. Each react iteration re-sends the full history, so prompt-token cost grows quadratically and the loop can exhaust the context window before finishing. Compaction trims that history, cutting both dollar cost and the chance of hitting the context / timeout wall.
On BrowseCompPlus (BCP) with Granite 4.1-8b (131K context, 830 questions,
loop_budget=400,per_question_timeout_s=1800, mean of 3 runs/strategy):noneclear_allllm_summarizellm_summarizeis the clear winner — cost −23.7% and accuracy +3.5 pp (+23% relative) at similar wall-clock, since the smaller prompt makes each iteration faster.clear_allalso helps (−16.8% cost, +2.9 pp) but its +35% wall-clock is less attractive when the budget allows the extra summarization call.Benchmark labels are the external eval-harness strategy names:
clear_all→WindowCompactor(size=0, pin_predicate=pin_react_initiator),llm_summarize→LLMSummarizeCompactor(pin_predicate=pin_react_initiator), both gated viaThresholdCompactor.Hardware / infrastructure
ibm-granite/granite-4.1-8b(bf16, native 131K context)--enable-prefix-cachingon. GPU 0 reserved for the BCP local search service.OpenAIBackend→ local vLLM,concurrency=56,loop_budget=400,per_question_timeout_s=1800,--compaction_threshold 50000(~38% of 131K),--compaction_keep_n 5.openai/gpt-oss-120bon the same node after agent teardown.Measured with a separate BCP eval harness running on Mellea's react framework; the feature itself has no external runtime dependency.
Design
A
Compactorreturns a fresh, compacted copy of a context (never mutates the input). Sync and generic inT(aContextsubtype), so a chat compactor narrows toChatContext -> ChatContext. Backend-calling compactors hide the async work behind the sync method. Two usage patterns: per-append (ChatContext(compactor=...), runs on everyadd()) and per-turn / manual (react(compactor=...)or a directcompact(ctx, backend=...)between turns).InlineCompactormarker —ChatContext.add()callscompact()without a backend on every append, soChatContext(compactor=...)accepts onlyInlineCompactorinstances (structuralCompactorsatisfaction isn't enough), steering backend-calling compactors toreact(compactor=...),ThresholdCompactor, or manualcompact().Pin predicates — a
PinPredicatereturns the index after the protected prefix (components[:idx]preserved,components[idx:]is the body):pin_nothing,pin_system,pin_system_and_initial_user, andpin_react_initiator(keeps the react goal + tool registration).Compactors:
WindowCompactor(size, pin_predicate=pin_system)(inline) — keep prefix + lastsizebody components;size=0drops the body ("clear all").ThresholdCompactor(inner, threshold)(inline) — forwards toinner.compactonly when measured token size exceedsthreshold; the composable token-gate.LLMSummarizeCompactor(default_backend, keep_n=5, pin_predicate=pin_nothing, ...)— summarize older body into one[CONTEXT SUMMARY]message, keep lastkeep_nverbatim. Not inline (calls the backend each time); use viareact/ThresholdCompactor/manual.Token size (
_last_usage_tokens) is read from the most recentModelOutputThunk.generation.usagetotal_tokens. Helperreact_summary_prompt(goal, max_tokens_hint)builds a research-flavoured template forLLMSummarizeCompactor.Integration — in
react, compaction runs once per turn after theis_finalcheck so terminal turns skip it (saves a wasted summary call). Custom compactors used withreactmust return aChatContext.Non-goals
No
mellea/core/changes. The single-modulemellea/stdlib/context.pyis refactored into amellea/stdlib/context/package (chat.py,simple.py,compactor.py); public imports frommellea.stdlib.contextare unchanged.Test plan
uv run pytest test/stdlib/test_compactor.py— 62 tests, ~12 s, no external servicesuv run pytest test/stdlib/frameworks/test_react_framework.py—compactor=integrationCovers each compactor's output/prefix preservation, all pin predicates, threshold gating (below/at/above,
<=0disables, no-thunk-yet), window sizing incl.size=0and thewindow_size=sugar, theInlineCompactormarker check, andLLMSummarizeCompactorrendering + best-effort error handling (backend failure → unchanged; programming errors propagate). Runnable examples indocs/examples/context/.Pitfalls
generation.usage→ token-gated compaction silently no-ops. All mainline backends populate it (AGENTS.md §5); a new backend without it disablesThresholdCompactor.LLMSummarizeCompactor(#1258).compact()is sync but must call an async backend, so it bridges through_run_coro_blocking(a worker-threadasyncio.run). Two known failure modes: (a) inside a running loop (e.g.react()) the worker blocks the calling thread, stalling the outer loop's other tasks — callbacks, telemetry flushers, cancellation — until the summary finishes; (b) a backend whose async client (e.g.httpx.AsyncClient) is bound to mellea's shared loop can't be used from the fresh worker loop, surfacing asRuntimeError: This event loop is already runningor a hung request. Both are flagged in the docstrings but can't be fixed without a protocol change. The resolution tracked in AsyncCompactorprotocol to replace sync↔async bridge inLLMSummarizeCompactor#1258 is an asyncacompact()on theCompactorprotocol thatreact()awaits directly — no bridge, no loop blocking, no cross-loop client hazard.is_finalcheck — easy to swap in a refactor; guarded by a comment, ideally a placement test later.Files