Skip to content

fix(autodata): fail-loud on empty solver content + tier knobs + the honest live result#42

Merged
drewstone merged 1 commit into
mainfrom
autodata/wide-tier
Jun 25, 2026
Merged

fix(autodata): fail-loud on empty solver content + tier knobs + the honest live result#42
drewstone merged 1 commit into
mainfrom
autodata/wide-tier

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Autopsy of the Autodata live null: the strong reasoning-model solver returned EMPTY content at maxTokens=1024 (reasoning ate the budget) and was scored 0 → a FALSE negative strong/weak gap across every run. Fix: maxTokens=8000 + fail-loud on empty (no silent zeros), solver tier as env knobs, price table for the wide tier. With the fix the gap is ~0 (not negative): on extractive doc-grounded QA an 8B (llama-3.1-8b) scores as well as a frontier model (gemini-2.5-pro) — so 0 examples discriminate, and the real lever is CHALLENGER difficulty (non-extractive questions), not model tier. Full autopsy + numbers in docs/results/autodata-live.md.

…Tokens + tier env knobs

The strong reasoning-model solver returned empty visible content at maxTokens=1024
(budget spent on hidden reasoning) and was silently scored 0 — manufacturing a false
negative strong/weak gap across every live run. Fix: maxTokens=8000 + throw on empty
content (no silent zeros). Make the solver tier an env knob (AUTODATA_WEAK_MODEL/
STRONG_MODEL/CHALLENGER_MODEL/JUDGE_MODEL) + price the wide tier. With the fix the gap
is ~0 (not negative): on extractive doc-grounded QA an 8B scores as well as a frontier
model — the lever is challenger difficulty, not model tier. See docs/results/autodata-live.md.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 2953db28

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T22:29:59Z

@drewstone drewstone merged commit 2db1feb into main Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants