feat(runtime): opt-in self-compaction for the agentic tool-loop#401
Conversation
Adds ToolLoopCompaction to runBrainLoop: when the running conversation exceeds thresholdTokens,
distill the accumulated middle to a compact progress note and reset to [system, task, digest] —
bounding a long agentic loop's own context window (the chapter-close a fresh-respawn loop gets for
free). Threaded through driverAgent -> supervisorAgent -> supervise({ compaction }). Default off,
zero behavior change. Distiller defaults to a brain self-summary + the settled-worker roster;
overridable. Fires at a clean turn boundary so it never orphans an assistant tool_calls from its
tool replies; preserveHead is clamped to default on invalid input.
Covered by tests/loops/tool-loop-compaction.test.ts (bounded window, head preservation,
below-threshold no-op, full-conversation distill). Adversarially reviewed.
Note: the supervisor-cost benchmark this was built to address is coordination-round-bound, not
context-bound, so compaction does not reduce that cost (see supervisor-lab
docs/results/supervisor-context-rescue.md). The primitive remains the right tool for genuinely
context-bound loops (long histories).
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 2 (1 medium-concern, 1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 180.2s (2 bridge agents) |
| Total | 180.2s |
💰 Value — sound
Adds an opt-in, off-by-default self-compaction primitive to the runtime canonical tool-loop so long-running brain loops can bound their own context window by distilling accumulated middle turns into a digest; cleanly threaded through driver → supervisor → supervise() and aligned with the repo's exis
- What it does: When enabled,
runBrainLoop(src/runtime/tool-loop.ts:120) estimates conversation size each turn and, once it exceedsthresholdTokens, replaces everything after the preserved head (system + task by default) with a single user digest produced by a configurabledistillfunction. The default distiller indriverAgent(src/runtime/supervise/coordination-driver.ts:240) pairs a brain-authored prog - Goals it achieves: 1) Provide a reusable primitive for genuinely context-bound tool loops, preventing unbounded transcript growth and O(history) per-turn re-billing. 2) Give the LLM-brain supervisor/driver its own chapter-close lifecycle analogous to a fresh-respawn loop, using the live
Scoperoster as durable state. 3) Keep the capability off by default so nothing changes unless explicitly opted in. - Assessment: Good change. It is coherent, safe, and in the grain of the codebase: default-off, uses the existing injected
ToolLoopChatseam, debits the distill cost through the meteredchat, preserves the OpenAI tool-history invariant by firing at a clean turn boundary, and combines brain narrative with ground-truth roster rather than trusting a summary. Tests (tests/loops/tool-loop-compaction.test.ts) cov - Better / existing approach: none — this is the right approach. I searched for existing compaction/summarization primitives (grep compact|compaction|summarize|truncate|prune across src) and found no equivalent in-loop context-window binder.
compactTrajectoryin src/runtime/strategy.ts:215 is a read-only trajectory renderer for the analyst firewall, not a live message mutator. The repo's own research doc (docs/research/smart - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound-with-nits
A clean, off-by-default self-compaction primitive on the shared tool-loop skeleton — coherent and well-built, but wired only to the supervisor-driver path where the PR itself documents the benefit was falsified, while the coding-shot loop where it's claimed genuinely useful does not receive it.
- Integration: Reachable and correctly threaded. The primitive lands on runBrainLoop (tool-loop.ts:120,145), the shared skeleton, so every loop consumer CAN use it. It's plumbed driverAgent (coordination-driver.ts:240-258) → supervisorAgent (supervisor-agent.ts:133, router arm only) → supervise() (supervise.ts:150). Tests pass (5/5). Two gaps in reachability: (1) routerToolLoop (router-client.ts:227) — the OTHER
- Fit with existing patterns: Fits the codebase grain. No existing in-loop self-compaction primitive exists — compactTrajectory (strategy.ts:215) is read-side compaction for an external analyst observer (truncates to 7000 chars for steering input), NOT loop-side compaction of the agent's own conversation, so this is genuinely new, not a duplicate. The opt-in optional-hook shape matches the existing pattern (mirrors hooks/maxTu
- Real-world viability: Solid on the happy and edge paths. Negative preserveHead is clamped to default 2 (tool-loop.ts:92, tested). Length guard (tool-loop.ts:94) makes a too-large preserveHead a no-op. Token estimator is an explicit chars/4 growth proxy (tool-loop.ts:64-77), only needs to track growth to trip the threshold — fine. One robustness seam: if a caller-supplied distill throws, the throw propagates and kills t
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🎯 Usefulness Audit
🟠 Wired to the loop where the hypothesis was falsified, not the loop where it's claimed valuable [problem-fit] ``
The PR body states the motivating hypothesis (supervisor loses on cost by re-billing its transcript) was tested and FALSIFIED — supervisor cost is round-count, not transcript size. The capability is nonetheless wired only into driverAgent→supervisorAgent→supervise (coordination-driver.ts:240). Meanwhile routerToolLoop (router-client.ts:227), the consumer that runs the actual coding-agent shots (strategy.ts:runShot:175, which carries
messagesacross depth-continuation shots), does NOT accept co
🟡 Supervisor-side type narrows away preserveHead/estimateTokens; delegate() doesn't forward compaction [ergonomics] ``
ToolLoopCompaction (tool-loop.ts:50-62) exposes preserveHead and estimateTokens, but the passthrough types on DriverAgentOptions (coordination-driver.ts:88-94), SupervisorAgentDeps (supervisor-agent.ts:101-107), and SuperviseOptions (supervise.ts:91-97) drop both — a supervisor caller cannot tune them without dropping to runBrainLoop directly. The loss is small (the driver's seed is always [system, task], so preserveHead=2 default is exactly right), but the narrowing is silent. Separately, deleg
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
❌ Needs Work —
|
| opencode-kimi | glm | deepseek | aggregate | |
|---|---|---|---|---|
| Readiness | 41 | 61 | 72 | 41 |
| Confidence | 75 | 75 | 75 | 75 |
| Correctness | 41 | 61 | 72 | 41 |
| Security | 41 | 61 | 72 | 41 |
| Testing | 41 | 61 | 72 | 41 |
| Architecture | 41 | 61 | 72 | 41 |
Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision.
Blocking
🔴 HIGH Default distiller omits live/in-flight workers from compaction roster — src/runtime/supervise/coordination-driver.ts
The default distiller at line 246 calls
coord.settled(), which returns only done/down workers (coordination.ts:126, 231-244). It does not include acquiring/running workers fromscope.view.nodes. The design comment at line 103 claims the roster is 'factual ground truth from the live scope' and the prompt tells the model to list 'every worker you spawned and its current status/result', but live workers are dropped. After compaction the brain may not know workers are in flight, leading to duplicate s
Other
🟠 MEDIUM Default distillation failure crashes the whole driver run — src/runtime/supervise/coordination-driver.ts
The default distiller at lines 244-250 awaits
chat(...)without try/catch, andmaybeCompactat tool-loop.ts:98 awaitsc.distill(messages)without guarding. If the router returns a non-transient error during the summary call, the exception propagates out ofrunBrainLoopand aborts the entiredriverAgent.act. Fix: wrap the default distiller's chat call in try/catch and fall back to the roster-only digest on failure; optionally catch inmaybeCompactand skip compaction.
🟠 MEDIUM Default distiller integration path is untested — src/runtime/supervise/coordination-driver.ts
Lines 240-253: the default distiller —
async (msgs) => { const roster = summarizeRoster(coord.settled()); const res = await chat([...msgs, {role:'user', content: distillInstruction}], []); ... }— has no test. tests/loops/tool-loop-compaction.test.ts covers only runBrainLoop's primitive with a syntheticdistill: () => 'DIGEST'; tests/loops/{coordination-driver,supervisor-agent,supervise-convenience}.test.ts have ZERO references tocompaction/thresholdTokens(verified via grep). The untested behaviors are the ones that matter in production: (a) the roster is read from the livecoord.settled()closure at compact time; (b) the distil
🟠 MEDIUM No integration test for coordination-driver compaction path — src/runtime/supervise/coordination-driver.ts
The 5 unit tests in
tests/loops/tool-loop-compaction.test.tscoverrunBrainLoopcompaction with a synthetic distill. No test exercises the coordination-driver's default distiller (lines 240-253) — the path that pairs the factualsummarizeRoster(coord.settled())output with a brain-authored narrative through a real (or mocked)chatcall. The default distiller also debits the conserved pool viascope.meter(line 225), which is unverified. A test with `driverAgent({ ..., compaction: { thre
🟠 MEDIUM preserveHead:0 is accepted and drops the system message — src/runtime/tool-loop.ts
At line 92 the clamp is
c.preserveHead >= 0, sopreserveHead: 0is accepted. The comment at lines 88-91 explicitly says zero would splice the system message away and that invalid values should fall back to the default. WithpreserveHead: 0,messages.splice(0, messages.length - head, digest)at line 99 removes everything including the system message. Fix: requirepreserveHead >= 1(or> 0) in the clamp.
🟠 MEDIUM runBrainLoop result.usage undercounts when compaction fires — src/runtime/tool-loop.ts
runBrainLoop accumulates
usageonly from its ownopts.chat()calls at line 148-150. Whencompaction.distillmakes an LLM call (as the coordination-driver default does via its meteredchatwrapper, line 247), that usage is NOT included in the returnedToolLoopResult.usage. Thehooks.onUsagecallback also does not fire for the distill inference. The coordination-driver discards the return value ([line 255](
🟡 LOW Driver/supervisor compaction types duplicate ToolLoopCompaction and hide fields — src/runtime/supervise/coordination-driver.ts
DriverAgentOptions.compaction(lines 88-94),SuperviseOptions.compaction(supervise.ts:91-97), andSupervisorAgentDeps.compaction(supervisor-agent.ts:101-107) repeat the same inline shape and do not exposepreserveHeadorestimateTokensfromToolLoopCompaction(tool-loop.ts:50-62), even though the underlyingrunBrainLoopsupports them. Fix: reuseToolLoopCompactionand forward the fields that are safe for callers to customize.
🟡 LOW Metered driver-inference turn numbers are shifted by compaction calls — src/runtime/supervise/coordination-driver.ts
The metered
chatwrapper at lines 216-234 increments a sharedturncounter on every invocation, including the distillation call at line 247. The actual driver inference turns therefore get offset/non-sequential numbers inscope.metermetadata. Fix: keep a separate counter for driver inference turns and do not count the compaction summary call as a driver turn.
🟡 LOW Turn counter double-counts on compact turns (metering metadata) — src/runtime/supervise/coordination-driver.ts
Lines 216-234: the wrapped
chatdoesturn += 1after every call. The default distiller (line 247) callschatonce for the digest BEFORE the regular turn'schatcall (tool-loop.ts:145 then :146). So a single driver turn that compacts reports TWOturnincrements in thescope.meter(..., { kind:'driver-inference', driver, turn, toolCalls })metadata. Observability consequence: turn numbers in traces will skip by 2 on compact turns, and thetoolCallsfield for the distill entry will list
🟡 LOW preserveHead/estimateTokens overrides unreachable from supervise() — src/runtime/supervise/supervise.ts
ToolLoopCompaction (tool-loop.ts:50-62) exposes
preserveHead?andestimateTokens?, but SuperviseOptions.compaction (supervise.ts:91-97), SupervisorAgentDeps.compaction (supervisor-agent.ts:101-107), and DriverAgentOptions.compaction (coordination-driver.ts:88-94) all only exposethresholdTokens,distill?,onCompact?. A caller who wants a different head size (e.g. 3 messages: system + task + a fixed briefing) or a tighter estimator (a real tokenizer) cannot reach those knobs without bypassing supervise() and calling runBrainLoop directly. The driver always seeds[system, user]so the default head=2 is correct for the in-tree caller, but the exported surface advertises less than the primitive can do. Fix: spread the full ToolLoopCompaction shape through, or document the constrain
🟡 LOW compaction silently no-ops on the sandbox/harness arm — src/runtime/supervise/supervisor-agent.ts
Lines 144-163 (the harness arm) never reference
deps.compaction. SuperviseOptions.compaction doc (supervise.ts:86-90) says 'router arm only', but nothing enforces it: a caller who passes{ profile: { harness: 'claude-code' }, compaction: { thresholdTokens: 10000 } }gets NO compaction and NO error. Either throw at construction (if (harness !== null && deps.compaction) throw new ValidationError(...)) so the silent no-op becomes a loud misconfiguration, or drop the option from the harness-arm type. The current state violates the repo's 'silent zero the house rules forbid' norm stated at coordination-driver.ts:153-154.
🟡 LOW Circular content in messages can crash token estimator — src/runtime/tool-loop.ts
estimateConversationTokensat line 72 usesJSON.stringify(content).lengthfor non-string content without guarding against circular structures. If a caller passes circularinitialMessagesand enables compaction,JSON.stringifythrows and kills the loop. Fix: use a safe stringify or wrap in try/catch.
🟡 LOW Distiller receives the live mutable messages array — src/runtime/tool-loop.ts
Line 98:
const digest = await c.distill(messages)passes the live loop array (typed ReadonlyArray but runtime-mutable). The splice at line 99 runs AFTER the distill returns, so a synchronous distiller is safe — but an async distiller that stores the reference and reads it later would observe the post-splice (collapsed) state. The default distiller in coordination-driver.ts:247 is safe (it spreads[...msgs, ...]). Defensive copyc.distill([...messages])would make the contract honest at runtime. Not exploitable through any in-tree c
🟡 LOW Estimator undershoots real token count by ~10-20% — src/runtime/tool-loop.ts
estimateConversationTokens (lines 67-77) counts content length + tool_call function.arguments length, but ignores: assistant tool_call id/type/function.name overhead, tool message tool_call_id + role overhead, JSON structural punctuation for non-string content (JSON.stringify adds quotes/braces but not the per-message wrapper). For OpenAI's tokenizer the per-message overhead is ~3-4 tokens each; in a 20-message loop that's ~60-80 tokens uncounted. Net effect: threshold trips later than the user expects. Not a correctness bug — compaction still fires — but if a user sets thresholdTokens to model_context - 8192 expecting a safety margin, they may overshoot. Fix: a
🟡 LOW preserveHead: 0 deletes the system message (validation only guards negative) — src/runtime/tool-loop.ts
The guard at line 92:
c.preserveHead !== undefined && c.preserveHead >= 0 ? c.preserveHead : 2allows0as a valid value, which would causemessages.splice(0, ...)to delete everything including the system prompt. The comment says 'Callers never pass preserveHead; this guards the exported primitive' — but the interface IS exported. A value of 0 is arguably nonsensical (no head to preserve is equivalent to no compaction meaningfulness per line 94), but the guard could be tightened to>= 1for defense-in-depth, or 0 could be explicitly
🟡 LOW Clean-boundary orphan-prevention invariant is not directly asserted — tests/loops/tool-loop-compaction.test.ts
The source comment at tool-loop.ts:48-49 calls out that compaction fires 'at a CLEAN turn boundary so it never orphans an assistant tool_calls from its tool replies' — the marquee correctness property. No test constructs a scenario that would FAIL if compaction were moved to a mid-round boundary (e.g. asserting that after compaction no assistant message with tool_calls lacks a following tool reply). The bounded test passes regardless of where compaction fires. Add a test that fails if splice happens between an assistant tool_calls and its tool replies. (Coverage gap, not a bug.)
🟡 LOW Comment in 'with compaction' test mis-describes the trace — tests/loops/tool-loop-compaction.test.ts
At line ~73 the comment says 'After each compaction the conversation resets to [system, task, digest] (3) and re-grows by one tool round (5) before the next compaction.' Hand-tracing maybeCompact (tool-loop.ts:145 fires BEFORE chat each turn) with threshold=200 vs ~1000 tokens/result: the brain sees sizes [2,3,3,3,3,3,3] — it NEVER sees 5, because compaction trips every turn. The assertion
Math.max(...sizes) <= 6is satisfied (actual max is 3) so the test is valid, but the comment is misleading about the loop's internal shape. Fix: either raise threshold so the brain genuinely observes the 3→5→3 cycle, or correct the comment to 'compaction fires every turn, so the brain only ever sees the 3-message compacted state.'
🟡 LOW Custom estimateTokens override is not exercised — tests/loops/tool-loop-compaction.test.ts
ToolLoopCompaction.estimateTokens (tool-loop.ts:59) is an exported override seam; only the default chars/4 estimator is tested. A trivial test passing a constant-returning estimator (e.g. () => 1_000_000) would prove the override is wired in and that onCompact reports the custom estimator's numbers, not the default's.
tangletools · 2026-06-27T18:12:02Z · trace
tangletools
left a comment
There was a problem hiding this comment.
❌ 1 Blocking Finding — 6ddff7fd
Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-27T18:12:02Z · immutable trace
What
Opt-in self-compaction for the canonical agentic tool-loop (
runBrainLoop). When a long-running brain's conversation exceedsthresholdTokens, the accumulated middle is distilled to a compact progress note and the conversation resets to[system, task, digest]— bounding the loop's own context window (the chapter-close a fresh-respawn loop gets for free). Threaded throughdriverAgent→supervisorAgent→supervise({ compaction }).Why — and honest scoping
Built to test the standing hypothesis that the LLM-brain supervisor loses on cost because it re-bills its whole transcript every turn. That hypothesis was tested and falsified (see supervisor-lab
docs/results/supervisor-context-rescue.md): the supervisor's cost is its coordination round count (spawn / await / observe), not the size of what it reads — so compaction does not reduce it on that benchmark. The primitive is kept because it is the right tool for genuinely context-bound loops (long histories that would otherwise overflow) — the same job harness compaction does for a coding agent. It is shipped OFF by default; nothing adopts it yet.Safety / correctness
compactionis unset.tool_callsfrom itstoolreplies. Verified empirically that the resulting[system, user, user]shape is accepted by the router (DeepSeek), and the whole loop is OpenAI-tool-format by construction.preserveHeadclamped to the default on invalid input.Tests / gates
tests/loops/tool-loop-compaction.test.ts— bounded-window, head-preservation, below-threshold no-op, full-conversation distill, negative-preserveHeadclamp (5 tests).preserveHeadhardening taken).docs:checkall green. Rebased on latestmain.