feat(autodata): non-extractive causal challenger — open the strong/weak gap#43
Merged
Conversation
…ak gap The discriminative data-creation loop nulled because questions leaked their answer (recall) and the default doc was memorized by the weak solver, so an 8B read/recalled as well as a frontier model and no example separated the tiers. - Challenger now authors NON-EXTRACTIVE causal/comparative/mechanism/thesis- consistency questions; the context holds premises but withholds the conclusion. - Judge sees the context and scores a reasoning dimension LOW for restatement (the paper's negative criterion); solver no longer sees the rubric. - Fold steers per reject reason toward the 'just right' band; recall vs causal challenger is selectable for calibration (calibrate.ts). - Per-attempt JSONL autopsy trail (both solvers' answers + scores) for every candidate, accepted or rejected, so a null is diagnosable. - Default grounding doc -> Mixtral (2401.04088, non-memorized for an 8B); proven models as defaults; routerChat retries transient 503/429/timeout with backoff. Live result (deepseek challenger+judge, groq-llama weak, gemini-2.5-pro strong): gap opens on the non-memorized doc — 2 accepted examples (gaps 0.62, 0.76; weak ~0.24 vs strong 0.86-1.00, real autopsied reasoning failures), fold widens the gap +0.202; the memorized Transformer paper still nulls. ~$0.15 live.
tangletools
previously approved these changes
Jun 26, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 16182c65
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-26T08:15:22Z
…s, clearing the accept bar does NOT (0/3 on independent re-run) My independent re-run got 0/3 accepted (vs the original 1-2/3); reading the answers, the weak 8B scored 0.75 on a competent, correct answer — it didn't struggle, so nothing cleared weak<0.5. What reproduces is the +0.20 gap-widening from the fold; what doesn't is the accepted count. Reframed from 'it works' to 'directionally confirmed but noisy/ under-powered at n=3' — the small-n mirage, flagged not buried.
tangletools
approved these changes
Jun 26, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 3121d5e7
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-26T08:28:43Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Make the Autodata discriminative data-creation loop actually open the strong/weak gap on real models. The loop previously nulled (0 accepted): an 8B solver scored as well as a frontier one, so no manufactured question separated the tiers.
Why it nulled (two compounding causes, both fixed)
Changes
reasoningdimension LOW when an answer merely restates it; the solver no longer sees the rubric (the mark scheme).AUTODATA_ATTEMPTS): every candidate (accepted or rejected) is dumped to JSONL with both solvers' answer text + scores, so a null is diagnosable.calibrate.ts): A/B the two challenger styles on the same doc.routerChatretries transient 503/429/timeout with bounded backoff.Live result
Models: weak
groq/llama-3.1-8b-instant, stronggemini-2.5-pro, challenger+judgedeepseek-v4-flash(the brief'sglm-5.2was returning upstream-capacity 503s; deepseek is the live, neutral substitute — a different family from both solvers, so no judge bias). Judge reliability verified first (a controlled good-vs-bad pair → consistent 0.82 separation).Fold widens the gap: plain first-draft 0.306 → refined 0.508 (Δ +0.202). Two accepted examples, both with the weak model genuinely failing the reasoning and the strong model deriving it (read the answers — not a judge or leakage artifact). Total live spend ≈ $0.15. Full writeup + autopsied examples in
docs/results/autodata-live.md.Verification
pnpm typecheck,pnpm lint,pnpm build,pnpm test(198 pass) all green.