feat(research): a research-driving loop, measured on held-out deep questions not dedup by drewstone · Pull Request #39 · tangle-network/agent-knowledge

drewstone · 2026-06-25T09:52:23Z

What

A research-DRIVING loop, measured on held-out deep questions — not on dedup.

The prior A/B in this repo (docs/two-agent-research-ab.md) measured source hygiene: how few sources a verifier admits at equal coverage. That win turned out to be de-duplication, which is the ceiling of what an admit-or-reject step can do — a filter can only make the knowledge base carry less, never answer more.

This branch builds the opposite agent and gives it the opposite metric:

The driving driver (src/research-driving-driver.ts) does not filter. It extracts each fetched source's claims, demands a second independent source per claim, generates comparative / mechanism / contradiction sub-questions, and steers the worker to chase them in the next round.
The metric is research quality, not hygiene: a firewalled exam of 20 deep questions across 5 ML topics (tests/loops/held-out-exam.ts), graded with a $0 deterministic substring grader the loop never sees. The exam discriminates depth (0/20 on a one-line definition, 20/20 on a mechanism-rich paragraph).
Three arms over the same real web worker, differing only in the driver: collect / verify-dedup / drive.

Result — an honest null

Driving does not reliably beat plain collection at answering hard held-out questions, and costs 12–16× more. The winner flips with the compute budget — the signature of web variance, not a topology effect:

arm	answered @ B=4	answered @ B=6	cost (5 topics)	tokens
single-agent (collect)	13/20	15/20	$0.005–0.007	~2.4–3.0k
verify/dedup	15/20	15/20	$0.031–0.027	~21k
driving (deepen)	16/20	13/20	$0.089–0.084	~69–71k

The within-arm swing (single 15→13, driving 16→13) is as large as the between-arm gap — the signature of a null at n=5.

Autopsy: every arm finished in one effective round (passes=2 everywhere) because the generic readiness gate is met by the first fetch — so the multi-round driving mechanism never got a round 2. We then gave it its fairest test: a controlled probe that forces 3 rounds so the driver actually steers. It still ties a blind worker, 8/12 vs 8/12 at ~9× the cost (better on RLHF 3 vs 1, worse on speculative decoding 2 vs 4 — they cancel). The null survives the fix; it is not a gate artifact.

The durable output is the apparatus: a firewalled deep-question exam with a $0 grader that can tell depth from surface, reusable for any future research-quality claim.

Changes

src/research-driving-driver.ts — the driving driver (claim extraction, corroboration tracking, depth sub-question steering)
tests/loops/held-out-exam.ts — 20-question firewalled deep-question exam + $0 deterministic grader
tests/loops/research-driving-ab.test.ts — the 3-arm live A/B + forced multi-round probe (offline wiring runs with no creds)
src/web-research-worker.ts — withCitedClaim plumbing reused by the driving path
docs/results/research-driving.md — the full result, per-topic tables, and the probe
docs/two-agent-research-ab.md — §9 reframes the prior finding (hygiene vs depth) and folds the null into the main paper

Verification

pnpm run typecheck — clean
pnpm run build — success
pnpm run lint — 2 pre-existing warnings in src/wikilinks.ts, none added by this branch
pnpm test — 166 offline tests pass, 11 skipped (the live-only arms, gated behind AGENT_KNOWLEDGE_LIVE)
Branch merges cleanly into main (no conflicts)

Live A/B reproduce commands are in docs/two-agent-research-ab.md §10 and docs/results/research-driving.md §8.

… source filtering) Add createResearchDrivingDriver, a ResearchDriver for runTwoAgentResearchLoop whose job is to drive the research DEEPER each round rather than filter sources (the opposite of the relevance/dedup/claim-grounding drivers). Each round it: - extracts the key claims from the worker's new sources (LLM via RouterClient, with a deterministic sentence-pull fallback so a round never extracts nothing); - tracks each claim's INDEPENDENT-source support (by canonical host) and detects contradictions between a new claim and one on the ledger; - generates the next round's deep sub-questions in four kinds — comparative, mechanism, gap, contradiction — from the accumulated claims; - flags weakly-supported (one independent source) or contradicted claims as invalidation targets and demands corroborating/refuting evidence; - folds the questions + challenges into the worker's next prompt via the loop's foldGaps -> steer channel. Completion (isComplete) gates on claim support, NOT source count: done only when every deep sub-question is addressed AND every claim has >= 2 independent sources or is explicitly contested. Reuses runTwoAgentResearchLoop, sha256, canonicalizeUrl, and the RouterClient chat surface; reinvents none of them. Tests: 12 unit (extraction, independent-host support, contradiction/contested, deep-question kinds, completion-vs-count) + 2 offline scripted e2e through the real loop (deeper questions across rounds; an unsupported claim flagged and only corroborated once a second independent host is steered in). All offline, no creds.

…A/B; harden router on transient 503 Add the research-QUALITY measurement the driving driver was built for. The prior A/B (research-loop-equal-compute) measured source CLEANLINESS — how few sources the verifier admits — which is the wrong metric for a driver whose thesis is depth+validation, not hygiene. This changes the metric to held-out deep-question answering. - tests/loops/held-out-exam.ts: a FIREWALLED exam — 5 ML topics x 4 deep questions (comparative / mechanism / contradiction), each with a checkable expected answer as keyword groups. The exam is NEVER shown to any loop; a $0 deterministic substring grader scores the KB AFTER it is built, so it can't leak into a model the loop observes. Calibrated: a one-line topic snippet answers 0/20; a deep mechanism-rich page answers 20/20 — the gap is real depth, not grader noise. - tests/loops/research-driving-ab.test.ts: a 3-arm live A/B at equal compute over the SAME real web worker — (A) single-agent collection, (B) the verify/dedup two-agent loop, (C) the research-DRIVING loop — scoring each KB on held-out-questions-answered + depth-components-covered + cost from RouterClient.usage(). Offline wiring tests prove the grader + harness so a live zero is a real null. The live test reports an honest SUPPORTED / NOT SUPPORTED verdict; it is a measurement, not a pass/fail gate. - src/web-research-worker.ts: bounded exponential backoff on transient upstream statuses (502/503/504/429) in the router client. glm-5.2 capacity flapped mid-run and a single 503 void-ed a whole multi-topic burn; this survives the blip and still fails loud after the retry budget, keeping the fail-closed contract.

…beat collection (honest null) Live 3-arm A/B (single / verify-dedup / DRIVING) on the firewalled 20-question exam, at two compute budgets, plus a controlled multi-round probe: - Equal compute: driving answered 16/20 at B=4 but 13/20 at B=6, while single swings 15->13 — the verdict flips with budget, so the ±1-3 question gap is web variance, not topology. Driving cost 12-16x more (~$0.084-0.089 vs ~$0.005-0.007 for single over 5 topics) for no reliable gain. - Autopsy: passes=2 on every topic/budget — the generic one-source readiness gate closes the loop after round 1, so the driving driver never steers a 2nd round. - Controlled probe (force 3 rounds, driving steers vs blind re-search): driving ties blind 8/12 vs 8/12 at ~9x the cost. The null survives its fairest test, so it is not a gate artifact. Steering changes WHAT is fetched (helped RLHF 3v1, hurt speculative decoding 2v4) but not how many held-out questions are answered. Adds the controlled multi-round probe to the A/B test (gated AGENT_KNOWLEDGE_LIVE + RQ_PROBE). Every figure in the doc is a measured per-arm delta from RouterClient.usage(), cross-checked against the raw run logs.

…hygiene vs depth The prior A/B measured source HYGIENE (how few sources a filter admits at equal coverage); a filter can only make the KB carry less, never answer more. This adds §9 reframing that as the ceiling of an admit-or-reject step and reports the companion research-DRIVING result honestly: a driver that chases depth + corroboration does NOT reliably beat plain collection on a firewalled 20-question deep-question exam, and costs 12-16x more — the verdict flips with the compute budget (web variance), and ties a blind worker even when forced to run its full multi-round mechanism. Numbers match docs/results/research-driving.md.

tangletools

✅ Auto-approved PR — `70e6bd55`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T09:52:30Z}

drewstone added 4 commits June 25, 2026 02:45

tangletools approved these changes Jun 25, 2026

View reviewed changes

drewstone merged commit 939610a into main Jun 25, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(research): a research-driving loop, measured on held-out deep questions not dedup#39

feat(research): a research-driving loop, measured on held-out deep questions not dedup#39
drewstone merged 4 commits into
mainfrom
feat/research-driving-loop

drewstone commented Jun 25, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 25, 2026

What

Result — an honest null

Changes

Verification

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 70e6bd55

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `70e6bd55`