feat(research): a research-driving loop, measured on held-out deep questions not dedup#39
Merged
Merged
Conversation
… source filtering) Add createResearchDrivingDriver, a ResearchDriver for runTwoAgentResearchLoop whose job is to drive the research DEEPER each round rather than filter sources (the opposite of the relevance/dedup/claim-grounding drivers). Each round it: - extracts the key claims from the worker's new sources (LLM via RouterClient, with a deterministic sentence-pull fallback so a round never extracts nothing); - tracks each claim's INDEPENDENT-source support (by canonical host) and detects contradictions between a new claim and one on the ledger; - generates the next round's deep sub-questions in four kinds — comparative, mechanism, gap, contradiction — from the accumulated claims; - flags weakly-supported (one independent source) or contradicted claims as invalidation targets and demands corroborating/refuting evidence; - folds the questions + challenges into the worker's next prompt via the loop's foldGaps -> steer channel. Completion (isComplete) gates on claim support, NOT source count: done only when every deep sub-question is addressed AND every claim has >= 2 independent sources or is explicitly contested. Reuses runTwoAgentResearchLoop, sha256, canonicalizeUrl, and the RouterClient chat surface; reinvents none of them. Tests: 12 unit (extraction, independent-host support, contradiction/contested, deep-question kinds, completion-vs-count) + 2 offline scripted e2e through the real loop (deeper questions across rounds; an unsupported claim flagged and only corroborated once a second independent host is steered in). All offline, no creds.
…A/B; harden router on transient 503 Add the research-QUALITY measurement the driving driver was built for. The prior A/B (research-loop-equal-compute) measured source CLEANLINESS — how few sources the verifier admits — which is the wrong metric for a driver whose thesis is depth+validation, not hygiene. This changes the metric to held-out deep-question answering. - tests/loops/held-out-exam.ts: a FIREWALLED exam — 5 ML topics x 4 deep questions (comparative / mechanism / contradiction), each with a checkable expected answer as keyword groups. The exam is NEVER shown to any loop; a $0 deterministic substring grader scores the KB AFTER it is built, so it can't leak into a model the loop observes. Calibrated: a one-line topic snippet answers 0/20; a deep mechanism-rich page answers 20/20 — the gap is real depth, not grader noise. - tests/loops/research-driving-ab.test.ts: a 3-arm live A/B at equal compute over the SAME real web worker — (A) single-agent collection, (B) the verify/dedup two-agent loop, (C) the research-DRIVING loop — scoring each KB on held-out-questions-answered + depth-components-covered + cost from RouterClient.usage(). Offline wiring tests prove the grader + harness so a live zero is a real null. The live test reports an honest SUPPORTED / NOT SUPPORTED verdict; it is a measurement, not a pass/fail gate. - src/web-research-worker.ts: bounded exponential backoff on transient upstream statuses (502/503/504/429) in the router client. glm-5.2 capacity flapped mid-run and a single 503 void-ed a whole multi-topic burn; this survives the blip and still fails loud after the retry budget, keeping the fail-closed contract.
…beat collection (honest null) Live 3-arm A/B (single / verify-dedup / DRIVING) on the firewalled 20-question exam, at two compute budgets, plus a controlled multi-round probe: - Equal compute: driving answered 16/20 at B=4 but 13/20 at B=6, while single swings 15->13 — the verdict flips with budget, so the ±1-3 question gap is web variance, not topology. Driving cost 12-16x more (~$0.084-0.089 vs ~$0.005-0.007 for single over 5 topics) for no reliable gain. - Autopsy: passes=2 on every topic/budget — the generic one-source readiness gate closes the loop after round 1, so the driving driver never steers a 2nd round. - Controlled probe (force 3 rounds, driving steers vs blind re-search): driving ties blind 8/12 vs 8/12 at ~9x the cost. The null survives its fairest test, so it is not a gate artifact. Steering changes WHAT is fetched (helped RLHF 3v1, hurt speculative decoding 2v4) but not how many held-out questions are answered. Adds the controlled multi-round probe to the A/B test (gated AGENT_KNOWLEDGE_LIVE + RQ_PROBE). Every figure in the doc is a measured per-arm delta from RouterClient.usage(), cross-checked against the raw run logs.
…hygiene vs depth The prior A/B measured source HYGIENE (how few sources a filter admits at equal coverage); a filter can only make the KB carry less, never answer more. This adds §9 reframing that as the ceiling of an admit-or-reject step and reports the companion research-DRIVING result honestly: a driver that chases depth + corroboration does NOT reliably beat plain collection on a firewalled 20-question deep-question exam, and costs 12-16x more — the verdict flips with the compute budget (web variance), and ties a blind worker even when forced to run its full multi-round mechanism. Numbers match docs/results/research-driving.md.
tangletools
approved these changes
Jun 25, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 70e6bd55
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T09:52:30Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A research-DRIVING loop, measured on held-out deep questions — not on dedup.
The prior A/B in this repo (
docs/two-agent-research-ab.md) measured source hygiene: how few sources a verifier admits at equal coverage. That win turned out to be de-duplication, which is the ceiling of what an admit-or-reject step can do — a filter can only make the knowledge base carry less, never answer more.This branch builds the opposite agent and gives it the opposite metric:
src/research-driving-driver.ts) does not filter. It extracts each fetched source's claims, demands a second independent source per claim, generates comparative / mechanism / contradiction sub-questions, and steers the worker to chase them in the next round.tests/loops/held-out-exam.ts), graded with a $0 deterministic substring grader the loop never sees. The exam discriminates depth (0/20 on a one-line definition, 20/20 on a mechanism-rich paragraph).Result — an honest null
Driving does not reliably beat plain collection at answering hard held-out questions, and costs 12–16× more. The winner flips with the compute budget — the signature of web variance, not a topology effect:
The within-arm swing (single 15→13, driving 16→13) is as large as the between-arm gap — the signature of a null at n=5.
Autopsy: every arm finished in one effective round (
passes=2everywhere) because the generic readiness gate is met by the first fetch — so the multi-round driving mechanism never got a round 2. We then gave it its fairest test: a controlled probe that forces 3 rounds so the driver actually steers. It still ties a blind worker, 8/12 vs 8/12 at ~9× the cost (better on RLHF 3 vs 1, worse on speculative decoding 2 vs 4 — they cancel). The null survives the fix; it is not a gate artifact.The durable output is the apparatus: a firewalled deep-question exam with a $0 grader that can tell depth from surface, reusable for any future research-quality claim.
Changes
src/research-driving-driver.ts— the driving driver (claim extraction, corroboration tracking, depth sub-question steering)tests/loops/held-out-exam.ts— 20-question firewalled deep-question exam + $0 deterministic gradertests/loops/research-driving-ab.test.ts— the 3-arm live A/B + forced multi-round probe (offline wiring runs with no creds)src/web-research-worker.ts—withCitedClaimplumbing reused by the driving pathdocs/results/research-driving.md— the full result, per-topic tables, and the probedocs/two-agent-research-ab.md— §9 reframes the prior finding (hygiene vs depth) and folds the null into the main paperVerification
pnpm run typecheck— cleanpnpm run build— successpnpm run lint— 2 pre-existing warnings insrc/wikilinks.ts, none added by this branchpnpm test— 166 offline tests pass, 11 skipped (the live-only arms, gated behindAGENT_KNOWLEDGE_LIVE)main(no conflicts)Live A/B reproduce commands are in
docs/two-agent-research-ab.md§10 anddocs/results/research-driving.md§8.