Skip to content

feat(runtime): opt-in self-compaction for the agentic tool-loop#401

Merged
drewstone merged 2 commits into
mainfrom
feat/supervisor-self-compaction
Jun 27, 2026
Merged

feat(runtime): opt-in self-compaction for the agentic tool-loop#401
drewstone merged 2 commits into
mainfrom
feat/supervisor-self-compaction

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Opt-in self-compaction for the canonical agentic tool-loop (runBrainLoop). When a long-running brain's conversation exceeds thresholdTokens, the accumulated middle is distilled to a compact progress note and the conversation resets to [system, task, digest] — bounding the loop's own context window (the chapter-close a fresh-respawn loop gets for free). Threaded through driverAgentsupervisorAgentsupervise({ compaction }).

Why — and honest scoping

Built to test the standing hypothesis that the LLM-brain supervisor loses on cost because it re-bills its whole transcript every turn. That hypothesis was tested and falsified (see supervisor-lab docs/results/supervisor-context-rescue.md): the supervisor's cost is its coordination round count (spawn / await / observe), not the size of what it reads — so compaction does not reduce it on that benchmark. The primitive is kept because it is the right tool for genuinely context-bound loops (long histories that would otherwise overflow) — the same job harness compaction does for a coding agent. It is shipped OFF by default; nothing adopts it yet.

Safety / correctness

  • Default OFF — zero behavior change when compaction is unset.
  • Fires at a clean turn boundary (after a turn's tool results are folded in) so it never orphans an assistant tool_calls from its tool replies. Verified empirically that the resulting [system, user, user] shape is accepted by the router (DeepSeek), and the whole loop is OpenAI-tool-format by construction.
  • preserveHead clamped to the default on invalid input.
  • Distiller defaults to a brain self-summary + the settled-worker roster; fully overridable.

Tests / gates

  • tests/loops/tool-loop-compaction.test.ts — bounded-window, head-preservation, below-threshold no-op, full-conversation distill, negative-preserveHead clamp (5 tests).
  • Adversarially reviewed (one empirically-refuted false positive on message structure; one preserveHead hardening taken).
  • typecheck / lint / full suite (1146 pass) / docs:check all green. Rebased on latest main.

Adds ToolLoopCompaction to runBrainLoop: when the running conversation exceeds thresholdTokens,
distill the accumulated middle to a compact progress note and reset to [system, task, digest] —
bounding a long agentic loop's own context window (the chapter-close a fresh-respawn loop gets for
free). Threaded through driverAgent -> supervisorAgent -> supervise({ compaction }). Default off,
zero behavior change. Distiller defaults to a brain self-summary + the settled-worker roster;
overridable. Fires at a clean turn boundary so it never orphans an assistant tool_calls from its
tool replies; preserveHead is clamped to default on invalid input.

Covered by tests/loops/tool-loop-compaction.test.ts (bounded window, head preservation,
below-threshold no-op, full-conversation distill). Adversarially reviewed.

Note: the supervisor-cost benchmark this was built to address is coordination-round-bound, not
context-bound, so compaction does not reduce that cost (see supervisor-lab
docs/results/supervisor-context-rescue.md). The primitive remains the right tool for genuinely
context-bound loops (long histories).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 2 (1 medium-concern, 1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 180.2s (2 bridge agents)
Total 180.2s

💰 Value — sound

Adds an opt-in, off-by-default self-compaction primitive to the runtime canonical tool-loop so long-running brain loops can bound their own context window by distilling accumulated middle turns into a digest; cleanly threaded through driver → supervisor → supervise() and aligned with the repo's exis

  • What it does: When enabled, runBrainLoop (src/runtime/tool-loop.ts:120) estimates conversation size each turn and, once it exceeds thresholdTokens, replaces everything after the preserved head (system + task by default) with a single user digest produced by a configurable distill function. The default distiller in driverAgent (src/runtime/supervise/coordination-driver.ts:240) pairs a brain-authored prog
  • Goals it achieves: 1) Provide a reusable primitive for genuinely context-bound tool loops, preventing unbounded transcript growth and O(history) per-turn re-billing. 2) Give the LLM-brain supervisor/driver its own chapter-close lifecycle analogous to a fresh-respawn loop, using the live Scope roster as durable state. 3) Keep the capability off by default so nothing changes unless explicitly opted in.
  • Assessment: Good change. It is coherent, safe, and in the grain of the codebase: default-off, uses the existing injected ToolLoopChat seam, debits the distill cost through the metered chat, preserves the OpenAI tool-history invariant by firing at a clean turn boundary, and combines brain narrative with ground-truth roster rather than trusting a summary. Tests (tests/loops/tool-loop-compaction.test.ts) cov
  • Better / existing approach: none — this is the right approach. I searched for existing compaction/summarization primitives (grep compact|compaction|summarize|truncate|prune across src) and found no equivalent in-loop context-window binder. compactTrajectory in src/runtime/strategy.ts:215 is a read-only trajectory renderer for the analyst firewall, not a live message mutator. The repo's own research doc (docs/research/smart
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound-with-nits

A clean, off-by-default self-compaction primitive on the shared tool-loop skeleton — coherent and well-built, but wired only to the supervisor-driver path where the PR itself documents the benefit was falsified, while the coding-shot loop where it's claimed genuinely useful does not receive it.

  • Integration: Reachable and correctly threaded. The primitive lands on runBrainLoop (tool-loop.ts:120,145), the shared skeleton, so every loop consumer CAN use it. It's plumbed driverAgent (coordination-driver.ts:240-258) → supervisorAgent (supervisor-agent.ts:133, router arm only) → supervise() (supervise.ts:150). Tests pass (5/5). Two gaps in reachability: (1) routerToolLoop (router-client.ts:227) — the OTHER
  • Fit with existing patterns: Fits the codebase grain. No existing in-loop self-compaction primitive exists — compactTrajectory (strategy.ts:215) is read-side compaction for an external analyst observer (truncates to 7000 chars for steering input), NOT loop-side compaction of the agent's own conversation, so this is genuinely new, not a duplicate. The opt-in optional-hook shape matches the existing pattern (mirrors hooks/maxTu
  • Real-world viability: Solid on the happy and edge paths. Negative preserveHead is clamped to default 2 (tool-loop.ts:92, tested). Length guard (tool-loop.ts:94) makes a too-large preserveHead a no-op. Token estimator is an explicit chars/4 growth proxy (tool-loop.ts:64-77), only needs to track growth to trip the threshold — fine. One robustness seam: if a caller-supplied distill throws, the throw propagates and kills t
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🎯 Usefulness Audit

🟠 Wired to the loop where the hypothesis was falsified, not the loop where it's claimed valuable [problem-fit] ``

The PR body states the motivating hypothesis (supervisor loses on cost by re-billing its transcript) was tested and FALSIFIED — supervisor cost is round-count, not transcript size. The capability is nonetheless wired only into driverAgent→supervisorAgent→supervise (coordination-driver.ts:240). Meanwhile routerToolLoop (router-client.ts:227), the consumer that runs the actual coding-agent shots (strategy.ts:runShot:175, which carries messages across depth-continuation shots), does NOT accept co

🟡 Supervisor-side type narrows away preserveHead/estimateTokens; delegate() doesn't forward compaction [ergonomics] ``

ToolLoopCompaction (tool-loop.ts:50-62) exposes preserveHead and estimateTokens, but the passthrough types on DriverAgentOptions (coordination-driver.ts:88-94), SupervisorAgentDeps (supervisor-agent.ts:101-107), and SuperviseOptions (supervise.ts:91-97) drop both — a supervisor caller cannot tune them without dropping to runBrainLoop directly. The loss is small (the driver's seed is always [system, task], so preserveHead=2 default is exactly right), but the narrowing is silent. Separately, deleg


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260627T180504Z

@tangletools

Copy link
Copy Markdown
Contributor

❌ Needs Work — 6ddff7fd

Readiness 41/100 · Confidence 75/100 · 18 findings (1 high, 5 medium, 12 low)

opencode-kimi glm deepseek aggregate
Readiness 41 61 72 41
Confidence 75 75 75 75
Correctness 41 61 72 41
Security 41 61 72 41
Testing 41 61 72 41
Architecture 41 61 72 41

Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision.

Blocking

🔴 HIGH Default distiller omits live/in-flight workers from compaction roster — src/runtime/supervise/coordination-driver.ts

The default distiller at line 246 calls coord.settled(), which returns only done/down workers (coordination.ts:126, 231-244). It does not include acquiring/running workers from scope.view.nodes. The design comment at line 103 claims the roster is 'factual ground truth from the live scope' and the prompt tells the model to list 'every worker you spawned and its current status/result', but live workers are dropped. After compaction the brain may not know workers are in flight, leading to duplicate s

Other

🟠 MEDIUM Default distillation failure crashes the whole driver run — src/runtime/supervise/coordination-driver.ts

The default distiller at lines 244-250 awaits chat(...) without try/catch, and maybeCompact at tool-loop.ts:98 awaits c.distill(messages) without guarding. If the router returns a non-transient error during the summary call, the exception propagates out of runBrainLoop and aborts the entire driverAgent.act. Fix: wrap the default distiller's chat call in try/catch and fall back to the roster-only digest on failure; optionally catch in maybeCompact and skip compaction.

🟠 MEDIUM Default distiller integration path is untested — src/runtime/supervise/coordination-driver.ts

Lines 240-253: the default distiller — async (msgs) => { const roster = summarizeRoster(coord.settled()); const res = await chat([...msgs, {role:'user', content: distillInstruction}], []); ... } — has no test. tests/loops/tool-loop-compaction.test.ts covers only runBrainLoop's primitive with a synthetic distill: () => 'DIGEST'; tests/loops/{coordination-driver,supervisor-agent,supervise-convenience}.test.ts have ZERO references to compaction/thresholdTokens (verified via grep). The untested behaviors are the ones that matter in production: (a) the roster is read from the live coord.settled() closure at compact time; (b) the distil

🟠 MEDIUM No integration test for coordination-driver compaction path — src/runtime/supervise/coordination-driver.ts

The 5 unit tests in tests/loops/tool-loop-compaction.test.ts cover runBrainLoop compaction with a synthetic distill. No test exercises the coordination-driver's default distiller (lines 240-253) — the path that pairs the factual summarizeRoster(coord.settled()) output with a brain-authored narrative through a real (or mocked) chat call. The default distiller also debits the conserved pool via scope.meter (line 225), which is unverified. A test with `driverAgent({ ..., compaction: { thre

🟠 MEDIUM preserveHead:0 is accepted and drops the system message — src/runtime/tool-loop.ts

At line 92 the clamp is c.preserveHead >= 0, so preserveHead: 0 is accepted. The comment at lines 88-91 explicitly says zero would splice the system message away and that invalid values should fall back to the default. With preserveHead: 0, messages.splice(0, messages.length - head, digest) at line 99 removes everything including the system message. Fix: require preserveHead >= 1 (or > 0) in the clamp.

🟠 MEDIUM runBrainLoop result.usage undercounts when compaction fires — src/runtime/tool-loop.ts

runBrainLoop accumulates usage only from its own opts.chat() calls at line 148-150. When compaction.distill makes an LLM call (as the coordination-driver default does via its metered chat wrapper, line 247), that usage is NOT included in the returned ToolLoopResult.usage. The hooks.onUsage callback also does not fire for the distill inference. The coordination-driver discards the return value ([line 255](

* THE canonical agentic tool-loop. One inference turn → run any requested tools → fold the

🟡 LOW Driver/supervisor compaction types duplicate ToolLoopCompaction and hide fields — src/runtime/supervise/coordination-driver.ts

DriverAgentOptions.compaction (lines 88-94), SuperviseOptions.compaction (supervise.ts:91-97), and SupervisorAgentDeps.compaction (supervisor-agent.ts:101-107) repeat the same inline shape and do not expose preserveHead or estimateTokens from ToolLoopCompaction (tool-loop.ts:50-62), even though the underlying runBrainLoop supports them. Fix: reuse ToolLoopCompaction and forward the fields that are safe for callers to customize.

🟡 LOW Metered driver-inference turn numbers are shifted by compaction calls — src/runtime/supervise/coordination-driver.ts

The metered chat wrapper at lines 216-234 increments a shared turn counter on every invocation, including the distillation call at line 247. The actual driver inference turns therefore get offset/non-sequential numbers in scope.meter metadata. Fix: keep a separate counter for driver inference turns and do not count the compaction summary call as a driver turn.

🟡 LOW Turn counter double-counts on compact turns (metering metadata) — src/runtime/supervise/coordination-driver.ts

Lines 216-234: the wrapped chat does turn += 1 after every call. The default distiller (line 247) calls chat once for the digest BEFORE the regular turn's chat call (tool-loop.ts:145 then :146). So a single driver turn that compacts reports TWO turn increments in the scope.meter(..., { kind:'driver-inference', driver, turn, toolCalls }) metadata. Observability consequence: turn numbers in traces will skip by 2 on compact turns, and the toolCalls field for the distill entry will list

🟡 LOW preserveHead/estimateTokens overrides unreachable from supervise() — src/runtime/supervise/supervise.ts

ToolLoopCompaction (tool-loop.ts:50-62) exposes preserveHead? and estimateTokens?, but SuperviseOptions.compaction (supervise.ts:91-97), SupervisorAgentDeps.compaction (supervisor-agent.ts:101-107), and DriverAgentOptions.compaction (coordination-driver.ts:88-94) all only expose thresholdTokens, distill?, onCompact?. A caller who wants a different head size (e.g. 3 messages: system + task + a fixed briefing) or a tighter estimator (a real tokenizer) cannot reach those knobs without bypassing supervise() and calling runBrainLoop directly. The driver always seeds [system, user] so the default head=2 is correct for the in-tree caller, but the exported surface advertises less than the primitive can do. Fix: spread the full ToolLoopCompaction shape through, or document the constrain

🟡 LOW compaction silently no-ops on the sandbox/harness arm — src/runtime/supervise/supervisor-agent.ts

Lines 144-163 (the harness arm) never reference deps.compaction. SuperviseOptions.compaction doc (supervise.ts:86-90) says 'router arm only', but nothing enforces it: a caller who passes { profile: { harness: 'claude-code' }, compaction: { thresholdTokens: 10000 } } gets NO compaction and NO error. Either throw at construction (if (harness !== null && deps.compaction) throw new ValidationError(...)) so the silent no-op becomes a loud misconfiguration, or drop the option from the harness-arm type. The current state violates the repo's 'silent zero the house rules forbid' norm stated at coordination-driver.ts:153-154.

🟡 LOW Circular content in messages can crash token estimator — src/runtime/tool-loop.ts

estimateConversationTokens at line 72 uses JSON.stringify(content).length for non-string content without guarding against circular structures. If a caller passes circular initialMessages and enables compaction, JSON.stringify throws and kills the loop. Fix: use a safe stringify or wrap in try/catch.

🟡 LOW Distiller receives the live mutable messages array — src/runtime/tool-loop.ts

Line 98: const digest = await c.distill(messages) passes the live loop array (typed ReadonlyArray but runtime-mutable). The splice at line 99 runs AFTER the distill returns, so a synchronous distiller is safe — but an async distiller that stores the reference and reads it later would observe the post-splice (collapsed) state. The default distiller in coordination-driver.ts:247 is safe (it spreads [...msgs, ...]). Defensive copy c.distill([...messages]) would make the contract honest at runtime. Not exploitable through any in-tree c

🟡 LOW Estimator undershoots real token count by ~10-20% — src/runtime/tool-loop.ts

estimateConversationTokens (lines 67-77) counts content length + tool_call function.arguments length, but ignores: assistant tool_call id/type/function.name overhead, tool message tool_call_id + role overhead, JSON structural punctuation for non-string content (JSON.stringify adds quotes/braces but not the per-message wrapper). For OpenAI's tokenizer the per-message overhead is ~3-4 tokens each; in a 20-message loop that's ~60-80 tokens uncounted. Net effect: threshold trips later than the user expects. Not a correctness bug — compaction still fires — but if a user sets thresholdTokens to model_context - 8192 expecting a safety margin, they may overshoot. Fix: a

🟡 LOW preserveHead: 0 deletes the system message (validation only guards negative) — src/runtime/tool-loop.ts

The guard at line 92: c.preserveHead !== undefined && c.preserveHead >= 0 ? c.preserveHead : 2 allows 0 as a valid value, which would cause messages.splice(0, ...) to delete everything including the system prompt. The comment says 'Callers never pass preserveHead; this guards the exported primitive' — but the interface IS exported. A value of 0 is arguably nonsensical (no head to preserve is equivalent to no compaction meaningfulness per line 94), but the guard could be tightened to >= 1 for defense-in-depth, or 0 could be explicitly

🟡 LOW Clean-boundary orphan-prevention invariant is not directly asserted — tests/loops/tool-loop-compaction.test.ts

The source comment at tool-loop.ts:48-49 calls out that compaction fires 'at a CLEAN turn boundary so it never orphans an assistant tool_calls from its tool replies' — the marquee correctness property. No test constructs a scenario that would FAIL if compaction were moved to a mid-round boundary (e.g. asserting that after compaction no assistant message with tool_calls lacks a following tool reply). The bounded test passes regardless of where compaction fires. Add a test that fails if splice happens between an assistant tool_calls and its tool replies. (Coverage gap, not a bug.)

🟡 LOW Comment in 'with compaction' test mis-describes the trace — tests/loops/tool-loop-compaction.test.ts

At line ~73 the comment says 'After each compaction the conversation resets to [system, task, digest] (3) and re-grows by one tool round (5) before the next compaction.' Hand-tracing maybeCompact (tool-loop.ts:145 fires BEFORE chat each turn) with threshold=200 vs ~1000 tokens/result: the brain sees sizes [2,3,3,3,3,3,3] — it NEVER sees 5, because compaction trips every turn. The assertion Math.max(...sizes) <= 6 is satisfied (actual max is 3) so the test is valid, but the comment is misleading about the loop's internal shape. Fix: either raise threshold so the brain genuinely observes the 3→5→3 cycle, or correct the comment to 'compaction fires every turn, so the brain only ever sees the 3-message compacted state.'

🟡 LOW Custom estimateTokens override is not exercised — tests/loops/tool-loop-compaction.test.ts

ToolLoopCompaction.estimateTokens (tool-loop.ts:59) is an exported override seam; only the default chars/4 estimator is tested. A trivial test passing a constant-returning estimator (e.g. () => 1_000_000) would prove the override is wired in and that onCompact reports the custom estimator's numbers, not the default's.


tangletools · 2026-06-27T18:12:02Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ 1 Blocking Finding — 6ddff7fd

Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 8 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-27T18:12:02Z · immutable trace

@drewstone drewstone merged commit 3137913 into main Jun 27, 2026
1 check passed
@drewstone drewstone deleted the feat/supervisor-self-compaction branch June 27, 2026 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants