Skip to content

feat(research): a research-driving loop, measured on held-out deep questions not dedup#39

Merged
drewstone merged 4 commits into
mainfrom
feat/research-driving-loop
Jun 25, 2026
Merged

feat(research): a research-driving loop, measured on held-out deep questions not dedup#39
drewstone merged 4 commits into
mainfrom
feat/research-driving-loop

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

A research-DRIVING loop, measured on held-out deep questions — not on dedup.

The prior A/B in this repo (docs/two-agent-research-ab.md) measured source hygiene: how few sources a verifier admits at equal coverage. That win turned out to be de-duplication, which is the ceiling of what an admit-or-reject step can do — a filter can only make the knowledge base carry less, never answer more.

This branch builds the opposite agent and gives it the opposite metric:

  • The driving driver (src/research-driving-driver.ts) does not filter. It extracts each fetched source's claims, demands a second independent source per claim, generates comparative / mechanism / contradiction sub-questions, and steers the worker to chase them in the next round.
  • The metric is research quality, not hygiene: a firewalled exam of 20 deep questions across 5 ML topics (tests/loops/held-out-exam.ts), graded with a $0 deterministic substring grader the loop never sees. The exam discriminates depth (0/20 on a one-line definition, 20/20 on a mechanism-rich paragraph).
  • Three arms over the same real web worker, differing only in the driver: collect / verify-dedup / drive.

Result — an honest null

Driving does not reliably beat plain collection at answering hard held-out questions, and costs 12–16× more. The winner flips with the compute budget — the signature of web variance, not a topology effect:

arm answered @ B=4 answered @ B=6 cost (5 topics) tokens
single-agent (collect) 13/20 15/20 $0.005–0.007 ~2.4–3.0k
verify/dedup 15/20 15/20 $0.031–0.027 ~21k
driving (deepen) 16/20 13/20 $0.089–0.084 ~69–71k

The within-arm swing (single 15→13, driving 16→13) is as large as the between-arm gap — the signature of a null at n=5.

Autopsy: every arm finished in one effective round (passes=2 everywhere) because the generic readiness gate is met by the first fetch — so the multi-round driving mechanism never got a round 2. We then gave it its fairest test: a controlled probe that forces 3 rounds so the driver actually steers. It still ties a blind worker, 8/12 vs 8/12 at ~9× the cost (better on RLHF 3 vs 1, worse on speculative decoding 2 vs 4 — they cancel). The null survives the fix; it is not a gate artifact.

The durable output is the apparatus: a firewalled deep-question exam with a $0 grader that can tell depth from surface, reusable for any future research-quality claim.

Changes

  • src/research-driving-driver.ts — the driving driver (claim extraction, corroboration tracking, depth sub-question steering)
  • tests/loops/held-out-exam.ts — 20-question firewalled deep-question exam + $0 deterministic grader
  • tests/loops/research-driving-ab.test.ts — the 3-arm live A/B + forced multi-round probe (offline wiring runs with no creds)
  • src/web-research-worker.tswithCitedClaim plumbing reused by the driving path
  • docs/results/research-driving.md — the full result, per-topic tables, and the probe
  • docs/two-agent-research-ab.md — §9 reframes the prior finding (hygiene vs depth) and folds the null into the main paper

Verification

  • pnpm run typecheck — clean
  • pnpm run build — success
  • pnpm run lint — 2 pre-existing warnings in src/wikilinks.ts, none added by this branch
  • pnpm test — 166 offline tests pass, 11 skipped (the live-only arms, gated behind AGENT_KNOWLEDGE_LIVE)
  • Branch merges cleanly into main (no conflicts)

Live A/B reproduce commands are in docs/two-agent-research-ab.md §10 and docs/results/research-driving.md §8.

… source filtering)

Add createResearchDrivingDriver, a ResearchDriver for runTwoAgentResearchLoop
whose job is to drive the research DEEPER each round rather than filter sources
(the opposite of the relevance/dedup/claim-grounding drivers). Each round it:

- extracts the key claims from the worker's new sources (LLM via RouterClient,
  with a deterministic sentence-pull fallback so a round never extracts nothing);
- tracks each claim's INDEPENDENT-source support (by canonical host) and detects
  contradictions between a new claim and one on the ledger;
- generates the next round's deep sub-questions in four kinds — comparative,
  mechanism, gap, contradiction — from the accumulated claims;
- flags weakly-supported (one independent source) or contradicted claims as
  invalidation targets and demands corroborating/refuting evidence;
- folds the questions + challenges into the worker's next prompt via the loop's
  foldGaps -> steer channel.

Completion (isComplete) gates on claim support, NOT source count: done only when
every deep sub-question is addressed AND every claim has >= 2 independent sources
or is explicitly contested. Reuses runTwoAgentResearchLoop, sha256,
canonicalizeUrl, and the RouterClient chat surface; reinvents none of them.

Tests: 12 unit (extraction, independent-host support, contradiction/contested,
deep-question kinds, completion-vs-count) + 2 offline scripted e2e through the
real loop (deeper questions across rounds; an unsupported claim flagged and only
corroborated once a second independent host is steered in). All offline, no creds.
…A/B; harden router on transient 503

Add the research-QUALITY measurement the driving driver was built for. The
prior A/B (research-loop-equal-compute) measured source CLEANLINESS — how few
sources the verifier admits — which is the wrong metric for a driver whose
thesis is depth+validation, not hygiene. This changes the metric to held-out
deep-question answering.

- tests/loops/held-out-exam.ts: a FIREWALLED exam — 5 ML topics x 4 deep
  questions (comparative / mechanism / contradiction), each with a checkable
  expected answer as keyword groups. The exam is NEVER shown to any loop; a $0
  deterministic substring grader scores the KB AFTER it is built, so it can't
  leak into a model the loop observes. Calibrated: a one-line topic snippet
  answers 0/20; a deep mechanism-rich page answers 20/20 — the gap is real
  depth, not grader noise.
- tests/loops/research-driving-ab.test.ts: a 3-arm live A/B at equal compute
  over the SAME real web worker — (A) single-agent collection, (B) the
  verify/dedup two-agent loop, (C) the research-DRIVING loop — scoring each KB
  on held-out-questions-answered + depth-components-covered + cost from
  RouterClient.usage(). Offline wiring tests prove the grader + harness so a
  live zero is a real null. The live test reports an honest SUPPORTED /
  NOT SUPPORTED verdict; it is a measurement, not a pass/fail gate.
- src/web-research-worker.ts: bounded exponential backoff on transient upstream
  statuses (502/503/504/429) in the router client. glm-5.2 capacity flapped
  mid-run and a single 503 void-ed a whole multi-topic burn; this survives the
  blip and still fails loud after the retry budget, keeping the fail-closed
  contract.
…beat collection (honest null)

Live 3-arm A/B (single / verify-dedup / DRIVING) on the firewalled 20-question
exam, at two compute budgets, plus a controlled multi-round probe:

- Equal compute: driving answered 16/20 at B=4 but 13/20 at B=6, while single
  swings 15->13 — the verdict flips with budget, so the ±1-3 question gap is web
  variance, not topology. Driving cost 12-16x more (~$0.084-0.089 vs ~$0.005-0.007
  for single over 5 topics) for no reliable gain.
- Autopsy: passes=2 on every topic/budget — the generic one-source readiness gate
  closes the loop after round 1, so the driving driver never steers a 2nd round.
- Controlled probe (force 3 rounds, driving steers vs blind re-search): driving
  ties blind 8/12 vs 8/12 at ~9x the cost. The null survives its fairest test, so
  it is not a gate artifact. Steering changes WHAT is fetched (helped RLHF 3v1,
  hurt speculative decoding 2v4) but not how many held-out questions are answered.

Adds the controlled multi-round probe to the A/B test (gated AGENT_KNOWLEDGE_LIVE
+ RQ_PROBE). Every figure in the doc is a measured per-arm delta from
RouterClient.usage(), cross-checked against the raw run logs.
…hygiene vs depth

The prior A/B measured source HYGIENE (how few sources a filter admits at
equal coverage); a filter can only make the KB carry less, never answer more.
This adds §9 reframing that as the ceiling of an admit-or-reject step and
reports the companion research-DRIVING result honestly: a driver that chases
depth + corroboration does NOT reliably beat plain collection on a firewalled
20-question deep-question exam, and costs 12-16x more — the verdict flips with
the compute budget (web variance), and ties a blind worker even when forced to
run its full multi-round mechanism. Numbers match docs/results/research-driving.md.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 70e6bd55

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T09:52:30Z

@drewstone drewstone merged commit 939610a into main Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants