feat(research): investment-thesis research eval — material-facts-surfaced, a metric that needs investigation not retrieval#40
Conversation
…terial facts) Add tests/eval/investment-thesis-set.ts: a firewalled eval set for the research loop, mirroring tests/loops/held-out-exam.ts. Give a loop a company + ticker + an as-of cutoff and grade the thesis it writes BLIND against held-out material facts it never saw — so a high score is thesis quality (surfacing buried, non-obvious drivers a one-shot ticker search misses), not teaching-to-the-test. Grading is a $0, model-free substring check, so the answer key never reaches a model the loop could observe. Five public companies, cutoff >= 18 months ago (outcome known but never a checklist item): SVB Financial (SIVB), Bed Bath & Beyond (BBBY), Carvana (CVNA), Peloton (PTON), Silvergate (SI). 27 facts across 8 analyst lenses (concentration, leverage, margin-trend, liquidity, capital-return, governance, off-balance-sheet, regulatory). Every fact is grounded in the company's primary SEC EDGAR 10-K filed on or before the cutoff, with the source URL (CIK-verified) and the literal value read from the filing — e.g. SIVB's ~$15.1B held-to-maturity unrealized loss ($91,321M amortized vs $76,169M fair value) sitting only in the footnotes against $16,004M total equity; BBBY's $574.9M FY2021 buyback in a year it lost $560M and generated $18M of operating cash; CVNA's interest expense up 2.8x to $486M plus Garcia-family related-party leases; PTON's negative (11)% hardware gross margin; SI's 99.5% noninterest-bearing deposits, ~58% from crypto exchanges. docs/eval/investment-material-facts.md is the answer key + provenance ledger: per-item filing URL and value, an honest curation-bias disclosure (downside/ outcome-selection skew, the lens distribution, single-source provenance), and a drop log of candidate items that could not be independently sourced at the cutoff (incl. First Republic, dropped because its 10-K is FDIC-filed not on EDGAR) — dropped rather than guessed. Offline tests (10) assert structure, provenance integrity (sourceUrl CIK matches the company), the >= 18-month cutoff floor, and that the deterministic grader surfaces a fact from its evidence, misses on filler, and scores a surface-only thesis below the held-out bar.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 2bfc0f72
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T13:11:27Z
…librated as a gate
Build the held-out investment-research loop into a runnable task and a KB-reading
metric, then CALIBRATE the metric as a gate before any A/B.
- materialFactsSurfaced(kb, checklist): reads a KB (root dir or index) and grades
its pages + source text against a company's held-out material-fact checklist
with the existing $0, model-free substring grader. Pure core
(materialFactsSurfacedInText) so calibration scores a thesis string directly.
- runInvestmentThesisTask({company,ticker,cik,cutoff}): drives the existing
two-agent research loop (web worker + driver) over web + SEC EDGAR with
analyst-lens readiness specs, then synthesizes a thesis page into the KB. Builds
nothing new for the loop — composes worker/driver/loop + a synthesis pass.
- Move the checklist + grader from tests/ into src/ (shippable eval primitive) so
the metric can import it without crossing the src rootDir boundary; tests/ keeps
a thin re-export shim. Export the metric + task + checklist from the index.
CALIBRATION GATE (offline, $0 — the binding result): shallow ticker-summary
theses surface 1/27 (4%), deep filings-grounded theses surface 27/27 (100%), a
+96pt gap; every company clears shallow<30% / deep>70%. The metric discriminates
research depth, not collection — so the A/B may proceed. An anti-circularity test
asserts the deep theses do not verbatim-embed the checklist evidence.
Calibration caught + fixed three over-loose SI grader groups (fired on bare
crypto / grew rapidly / proprietary) that let a surface summary score 60%;
tightened to require the buried signal (concentration / $14.3B / SEN).
Live path cost-gated: glm-5.2 smoke OK; web search returns the real CVNA 10-K on
sec.gov; CVNA live pilot surfaced 2/5 facts for $0.027 from the fetched filing.
Docs: docs/results/investment-calibration.md. lint/typecheck/build clean,
184 offline tests pass.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — d3cbcb47
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T13:37:02Z
Add the single-agent collection baseline (createCollectionResearchDriver) and extend the investment-thesis A/B to run all three topologies — collection / verify / driving — over the five held-out companies at matched compute, scoring each KB with the firewalled materialFactsSurfaced metric and pricing each arm from RouterClient.usage(). Live result (5 companies x 3 arms x 3 rounds, glm-5.2, $0.36): driving surfaces the most buried facts (16/27, 59%) vs blind collection (11/27, 41%) for ~1.9x the cost, but at n=5 the paired-bootstrap CI crosses zero (P(delta<=0)=0.08) — promising, under-powered, not a clean win. Verify does not beat collection (10/27) and on two companies its strict primary-source gate zeroes the KB (autopsied: it correctly rejects aggregators the worker fetched in place of the EDGAR filing). Full writeup + per-company matrix + significance in docs/results/investment-thesis.md.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 338bc547
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T14:35:12Z
…ibration + 3-arm A/B in one section
Rewrite docs/results/investment-thesis.md as one coherent paper section covering all
three parts of the result, and fold the standalone calibration doc into it (the two
files said the same thing; one result, one doc, matching the repo's docs/results/*.md
convention):
- §1 the domain reframe: we left ML-paper retrieval — where a single search already
collects the answer, so the metric could only measure collection and the §9 topology
A/B came back a structural null — for investment research, where the material facts
are buried in 10-K footnotes and a single fetch provably cannot surface them, so a
topology A/B is finally well-posed.
- §2 the calibration (the gate the ML exam passed only weakly): the $0 model-free
metric discriminates depth on the harder domain — shallow ticker summary 1/27 (4%),
deep filings-grounded thesis 27/27 (100%), +96pp — with the anti-circularity guard
and the autopsied Silvergate grader-tightening kept inline. Reproducible offline.
- §3-§5 the live 3-arm verdict, unchanged numbers: driving surfaces the most buried
facts (16/27, 59%) vs blind collection (11/27, 41%) for ~1.9x the cost, but at n=5
the paired-bootstrap CI crosses zero (P(delta<=0)=0.08); verify does not beat
collection (10/27) and on two companies its strict primary-only gate zeroes the KB.
Honest: no topology significantly beats collection — driving only leans positive.
Reference the unified result from the main paper (docs/two-agent-research-ab.md): a new
§9.1 connecting the ML null ("the domain was too easy") to the reframe, plus the
per-result link list at the tail.
A/B numbers transcribed verbatim from commit 338bc54's recorded live run; calibration
numbers verified green offline (pnpm test investment-calibration). Docs-only change.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 64a9a706
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T14:41:43Z
What
Investment-thesis research eval: a domain reframe, a calibrated metric, and a controlled 3-arm topology A/B — the full result, on the branch and written up.
We moved the research eval off ML-paper retrieval — where a single web search already returns the answer, so the metric could only measure collection and the prior topology A/B came back a structural null — onto investment research, where the material facts are buried in 10-K footnotes and genuinely require investigation to surface. That is the domain where a smarter coordinator finally has room to beat blind collection, so the A/B is well-posed.
The three parts of the result
1. The held-out set —
tests/eval/investment-thesis-set.ts: 5 public companies, each cutoff>= 18 monthsago, 27 material facts across 8 analyst lenses, every fact read live out of the primary SEC EDGAR 10-K, every value knowable at the cutoff (the eventual collapse is recorded for the reader only and is never graded). Grading is a$0, model-free substring check; the checklist is firewalled.2. Calibration — does the metric discriminate depth? (
pnpm test investment-calibration, $0, offline, the binding gate): a deliberately shallow ticker summary scores 1/27 (4%), a deep filings-grounded thesis scores 27/27 (100%) — a +96-point gap, every company clearing the bars (shallow< 30%, deep> 70%). An anti-circularity guard asserts no deep thesis verbatim-embeds the checklist evidence, and calibration itself caught + closed three over-loose Silvergate grader groups before any A/B depended on them. This is the gate the ML deep-question exam passed only in spirit — same discrimination, materially harder domain.3. The live 3-arm A/B (5 companies × 3 arms × 3 rounds, glm-5.2, $0.36, matched compute):
Honest verdict: no topology significantly beats collection. Driving surfaces the most buried facts (+5 over collection, +18pp) for ~1.9× the cost, but at n=5 the paired-bootstrap CI crosses zero (P(Δ≤0)=0.08) — promising, under-powered, consistent with the project's prior topology nulls. Verify does not beat collection, and on two companies its strict primary-source gate zeroes the KB (autopsied: it correctly rejects the aggregator pages the worker fetched in place of the EDGAR filing). Every dollar is real per-call provenance from
RouterClient.usage().This PR's doc change
Unifies the writeup into one coherent section —
docs/results/investment-thesis.md(§1 reframe, §2 calibration, §3-§5 the A/B verdict + autopsy) — folding in the standalone calibration doc so there is one result, one doc, matching the repo'sdocs/results/*.mdconvention. References it from the main paper (docs/two-agent-research-ab.md§9.1 + the per-result link list).Provenance + verification
pnpm test investment-calibration— green.338bc54.docs/eval/investment-material-facts.md.pnpm typecheckclean,pnpm buildsuccess,pnpm lintexit 0 (2 pre-existing warnings in unrelatedsrc/lint.ts/src/wikilinks.ts),pnpm test184 passed / 12 live-gated skipped.Limitations
Does not prove driving significantly beats collection (n=5, CI crosses zero). The 27-fact checklist carries a documented curation bias (downside-risk, distressed-name skew). Next rung: expand the held-out set past 5 companies and re-run at n>=24 so a +1-fact/company effect can clear a paired bootstrap; driving is the arm worth funding that test on, verify is not.