feat(autodata): non-extractive causal challenger — open the strong/weak gap by drewstone · Pull Request #43 · tangle-network/agent-knowledge

drewstone · 2026-06-26T08:15:15Z

What

Make the Autodata discriminative data-creation loop actually open the strong/weak gap on real models. The loop previously nulled (0 accepted): an 8B solver scored as well as a frontier one, so no manufactured question separated the tiers.

Why it nulled (two compounding causes, both fixed)

Recall / leakage — the challenger wrote lookup questions whose answer sat in the provided context, so reading beat reasoning.
Memorized doc — the default grounding doc was "Attention Is All You Need", which an 8B has memorized, so even reasoning questions were answerable from pretraining.

Changes

Non-extractive causal challenger: authors CAUSAL / COMPARATIVE / MECHANISM / THESIS-CONSISTENCY questions; the context holds premises but withholds the conclusion (the answer must be derived).
Reasoning rubric / negative criterion: the judge now sees the context and scores a reasoning dimension LOW when an answer merely restates it; the solver no longer sees the rubric (the mark scheme).
Targeted fold: each reject reason steers the next draft toward the "just right" band ("too easy" → go non-extractive/harder, "too hard" → ease, "not discriminative" → sharpen).
Per-attempt autopsy trail (AUTODATA_ATTEMPTS): every candidate (accepted or rejected) is dumped to JSONL with both solvers' answer text + scores, so a null is diagnosable.
Recall-vs-causal calibration (calibrate.ts): A/B the two challenger styles on the same doc.
Default grounding doc → Mixtral (2401.04088), non-memorized for an 8B; proven models as defaults; routerChat retries transient 503/429/timeout with bounded backoff.

Live result

Models: weak groq/llama-3.1-8b-instant, strong gemini-2.5-pro, challenger+judge deepseek-v4-flash (the brief's glm-5.2 was returning upstream-capacity 503s; deepseek is the live, neutral substitute — a different family from both solvers, so no judge bias). Judge reliability verified first (a controlled good-vs-bad pair → consistent 0.82 separation).

condition	accepted	accepted-example gap	weak / strong
memorized Transformer paper, recall challenger	0	mean gap 0.117	weak saturates 0.68–0.78
Mixtral (non-memorized), causal, target=3	1 / 3	0.62	0.24 / 0.86 (fold turned a "too easy" draft into an accept)
Mixtral, causal, target=1 maxRetries=4	1 / 1	0.76	0.24 / 1.00

Fold widens the gap: plain first-draft 0.306 → refined 0.508 (Δ +0.202). Two accepted examples, both with the weak model genuinely failing the reasoning and the strong model deriving it (read the answers — not a judge or leakage artifact). Total live spend ≈ $0.15. Full writeup + autopsied examples in docs/results/autodata-live.md.

Verification

pnpm typecheck, pnpm lint, pnpm build, pnpm test (198 pass) all green.

…ak gap The discriminative data-creation loop nulled because questions leaked their answer (recall) and the default doc was memorized by the weak solver, so an 8B read/recalled as well as a frontier model and no example separated the tiers. - Challenger now authors NON-EXTRACTIVE causal/comparative/mechanism/thesis- consistency questions; the context holds premises but withholds the conclusion. - Judge sees the context and scores a reasoning dimension LOW for restatement (the paper's negative criterion); solver no longer sees the rubric. - Fold steers per reject reason toward the 'just right' band; recall vs causal challenger is selectable for calibration (calibrate.ts). - Per-attempt JSONL autopsy trail (both solvers' answers + scores) for every candidate, accepted or rejected, so a null is diagnosable. - Default grounding doc -> Mixtral (2401.04088, non-memorized for an 8B); proven models as defaults; routerChat retries transient 503/429/timeout with backoff. Live result (deepseek challenger+judge, groq-llama weak, gemini-2.5-pro strong): gap opens on the non-memorized doc — 2 accepted examples (gaps 0.62, 0.76; weak ~0.24 vs strong 0.86-1.00, real autopsied reasoning failures), fold widens the gap +0.202; the memorized Transformer paper still nulls. ~$0.15 live.

tangletools

✅ Auto-approved PR — `16182c65`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-26T08:15:22Z}

…s, clearing the accept bar does NOT (0/3 on independent re-run) My independent re-run got 0/3 accepted (vs the original 1-2/3); reading the answers, the weak 8B scored 0.75 on a competent, correct answer — it didn't struggle, so nothing cleared weak<0.5. What reproduces is the +0.20 gap-widening from the fold; what doesn't is the accepted count. Reframed from 'it works' to 'directionally confirmed but noisy/ under-powered at n=3' — the small-n mirage, flagged not buried.

tangletools

✅ Auto-approved PR — `3121d5e7`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-26T08:28:43Z}

tangletools previously approved these changes Jun 26, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 3121d5e June 26, 2026 08:28

drewstone merged commit 0230286 into main Jun 26, 2026

tangletools approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(autodata): non-extractive causal challenger — open the strong/weak gap#43

feat(autodata): non-extractive causal challenger — open the strong/weak gap#43
drewstone merged 2 commits into
mainfrom
autodata/causal-challenger

drewstone commented Jun 26, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 26, 2026

What

Why it nulled (two compounding causes, both fixed)

Changes

Live result

Verification

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 16182c65

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 3121d5e7

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `16182c65`

✅ Auto-approved PR — `3121d5e7`