Skip to content

feat(autodata): non-extractive causal challenger — open the strong/weak gap#43

Merged
drewstone merged 2 commits into
mainfrom
autodata/causal-challenger
Jun 26, 2026
Merged

feat(autodata): non-extractive causal challenger — open the strong/weak gap#43
drewstone merged 2 commits into
mainfrom
autodata/causal-challenger

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Make the Autodata discriminative data-creation loop actually open the strong/weak gap on real models. The loop previously nulled (0 accepted): an 8B solver scored as well as a frontier one, so no manufactured question separated the tiers.

Why it nulled (two compounding causes, both fixed)

  1. Recall / leakage — the challenger wrote lookup questions whose answer sat in the provided context, so reading beat reasoning.
  2. Memorized doc — the default grounding doc was "Attention Is All You Need", which an 8B has memorized, so even reasoning questions were answerable from pretraining.

Changes

  • Non-extractive causal challenger: authors CAUSAL / COMPARATIVE / MECHANISM / THESIS-CONSISTENCY questions; the context holds premises but withholds the conclusion (the answer must be derived).
  • Reasoning rubric / negative criterion: the judge now sees the context and scores a reasoning dimension LOW when an answer merely restates it; the solver no longer sees the rubric (the mark scheme).
  • Targeted fold: each reject reason steers the next draft toward the "just right" band ("too easy" → go non-extractive/harder, "too hard" → ease, "not discriminative" → sharpen).
  • Per-attempt autopsy trail (AUTODATA_ATTEMPTS): every candidate (accepted or rejected) is dumped to JSONL with both solvers' answer text + scores, so a null is diagnosable.
  • Recall-vs-causal calibration (calibrate.ts): A/B the two challenger styles on the same doc.
  • Default grounding doc → Mixtral (2401.04088), non-memorized for an 8B; proven models as defaults; routerChat retries transient 503/429/timeout with bounded backoff.

Live result

Models: weak groq/llama-3.1-8b-instant, strong gemini-2.5-pro, challenger+judge deepseek-v4-flash (the brief's glm-5.2 was returning upstream-capacity 503s; deepseek is the live, neutral substitute — a different family from both solvers, so no judge bias). Judge reliability verified first (a controlled good-vs-bad pair → consistent 0.82 separation).

condition accepted accepted-example gap weak / strong
memorized Transformer paper, recall challenger 0 mean gap 0.117 weak saturates 0.68–0.78
Mixtral (non-memorized), causal, target=3 1 / 3 0.62 0.24 / 0.86 (fold turned a "too easy" draft into an accept)
Mixtral, causal, target=1 maxRetries=4 1 / 1 0.76 0.24 / 1.00

Fold widens the gap: plain first-draft 0.306 → refined 0.508 (Δ +0.202). Two accepted examples, both with the weak model genuinely failing the reasoning and the strong model deriving it (read the answers — not a judge or leakage artifact). Total live spend ≈ $0.15. Full writeup + autopsied examples in docs/results/autodata-live.md.

Verification

pnpm typecheck, pnpm lint, pnpm build, pnpm test (198 pass) all green.

…ak gap

The discriminative data-creation loop nulled because questions leaked their
answer (recall) and the default doc was memorized by the weak solver, so an 8B
read/recalled as well as a frontier model and no example separated the tiers.

- Challenger now authors NON-EXTRACTIVE causal/comparative/mechanism/thesis-
  consistency questions; the context holds premises but withholds the conclusion.
- Judge sees the context and scores a reasoning dimension LOW for restatement
  (the paper's negative criterion); solver no longer sees the rubric.
- Fold steers per reject reason toward the 'just right' band; recall vs causal
  challenger is selectable for calibration (calibrate.ts).
- Per-attempt JSONL autopsy trail (both solvers' answers + scores) for every
  candidate, accepted or rejected, so a null is diagnosable.
- Default grounding doc -> Mixtral (2401.04088, non-memorized for an 8B); proven
  models as defaults; routerChat retries transient 503/429/timeout with backoff.

Live result (deepseek challenger+judge, groq-llama weak, gemini-2.5-pro strong):
gap opens on the non-memorized doc — 2 accepted examples (gaps 0.62, 0.76; weak
~0.24 vs strong 0.86-1.00, real autopsied reasoning failures), fold widens the
gap +0.202; the memorized Transformer paper still nulls. ~$0.15 live.
tangletools
tangletools previously approved these changes Jun 26, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 16182c65

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-26T08:15:22Z

…s, clearing the accept bar does NOT (0/3 on independent re-run)

My independent re-run got 0/3 accepted (vs the original 1-2/3); reading the answers,
the weak 8B scored 0.75 on a competent, correct answer — it didn't struggle, so nothing
cleared weak<0.5. What reproduces is the +0.20 gap-widening from the fold; what doesn't
is the accepted count. Reframed from 'it works' to 'directionally confirmed but noisy/
under-powered at n=3' — the small-n mirage, flagged not buried.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 3121d5e7

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-26T08:28:43Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants