Skip to content

feat(research): investment-thesis research eval — material-facts-surfaced, a metric that needs investigation not retrieval#40

Merged
drewstone merged 4 commits into
mainfrom
feat/investment-thesis-eval
Jun 25, 2026
Merged

feat(research): investment-thesis research eval — material-facts-surfaced, a metric that needs investigation not retrieval#40
drewstone merged 4 commits into
mainfrom
feat/investment-thesis-eval

Conversation

@drewstone

@drewstone drewstone commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What

Investment-thesis research eval: a domain reframe, a calibrated metric, and a controlled 3-arm topology A/B — the full result, on the branch and written up.

We moved the research eval off ML-paper retrieval — where a single web search already returns the answer, so the metric could only measure collection and the prior topology A/B came back a structural null — onto investment research, where the material facts are buried in 10-K footnotes and genuinely require investigation to surface. That is the domain where a smarter coordinator finally has room to beat blind collection, so the A/B is well-posed.

The three parts of the result

1. The held-out settests/eval/investment-thesis-set.ts: 5 public companies, each cutoff >= 18 months ago, 27 material facts across 8 analyst lenses, every fact read live out of the primary SEC EDGAR 10-K, every value knowable at the cutoff (the eventual collapse is recorded for the reader only and is never graded). Grading is a $0, model-free substring check; the checklist is firewalled.

2. Calibration — does the metric discriminate depth? (pnpm test investment-calibration, $0, offline, the binding gate): a deliberately shallow ticker summary scores 1/27 (4%), a deep filings-grounded thesis scores 27/27 (100%) — a +96-point gap, every company clearing the bars (shallow < 30%, deep > 70%). An anti-circularity guard asserts no deep thesis verbatim-embeds the checklist evidence, and calibration itself caught + closed three over-loose Silvergate grader groups before any A/B depended on them. This is the gate the ML deep-question exam passed only in spirit — same discrimination, materially harder domain.

3. The live 3-arm A/B (5 companies × 3 arms × 3 rounds, glm-5.2, $0.36, matched compute):

arm material facts cost (5 cos.)
collection (blind) 11/27 (41%) $0.082
verify/dedup 10/27 (37%) $0.125
driving (deepen) 16/27 (59%) $0.157

Honest verdict: no topology significantly beats collection. Driving surfaces the most buried facts (+5 over collection, +18pp) for ~1.9× the cost, but at n=5 the paired-bootstrap CI crosses zero (P(Δ≤0)=0.08) — promising, under-powered, consistent with the project's prior topology nulls. Verify does not beat collection, and on two companies its strict primary-source gate zeroes the KB (autopsied: it correctly rejects the aggregator pages the worker fetched in place of the EDGAR filing). Every dollar is real per-call provenance from RouterClient.usage().

This PR's doc change

Unifies the writeup into one coherent section — docs/results/investment-thesis.md (§1 reframe, §2 calibration, §3-§5 the A/B verdict + autopsy) — folding in the standalone calibration doc so there is one result, one doc, matching the repo's docs/results/*.md convention. References it from the main paper (docs/two-agent-research-ab.md §9.1 + the per-result link list).

Provenance + verification

  • Calibration numbers reproducible offline ($0): pnpm test investment-calibration — green.
  • A/B numbers transcribed verbatim from the live run recorded in commit 338bc54.
  • Per-fact provenance, curation-bias disclosure, and the drop log: docs/eval/investment-material-facts.md.
  • Gates on the branch: pnpm typecheck clean, pnpm build success, pnpm lint exit 0 (2 pre-existing warnings in unrelated src/lint.ts/src/wikilinks.ts), pnpm test 184 passed / 12 live-gated skipped.

Limitations

Does not prove driving significantly beats collection (n=5, CI crosses zero). The 27-fact checklist carries a documented curation bias (downside-risk, distressed-name skew). Next rung: expand the held-out set past 5 companies and re-run at n>=24 so a +1-fact/company effect can clear a paired bootstrap; driving is the arm worth funding that test on, verify is not.

…terial facts)

Add tests/eval/investment-thesis-set.ts: a firewalled eval set for the
research loop, mirroring tests/loops/held-out-exam.ts. Give a loop a company
+ ticker + an as-of cutoff and grade the thesis it writes BLIND against
held-out material facts it never saw — so a high score is thesis quality
(surfacing buried, non-obvious drivers a one-shot ticker search misses), not
teaching-to-the-test. Grading is a $0, model-free substring check, so the
answer key never reaches a model the loop could observe.

Five public companies, cutoff >= 18 months ago (outcome known but never a
checklist item): SVB Financial (SIVB), Bed Bath & Beyond (BBBY), Carvana
(CVNA), Peloton (PTON), Silvergate (SI). 27 facts across 8 analyst lenses
(concentration, leverage, margin-trend, liquidity, capital-return,
governance, off-balance-sheet, regulatory).

Every fact is grounded in the company's primary SEC EDGAR 10-K filed on or
before the cutoff, with the source URL (CIK-verified) and the literal value
read from the filing — e.g. SIVB's ~$15.1B held-to-maturity unrealized loss
($91,321M amortized vs $76,169M fair value) sitting only in the footnotes
against $16,004M total equity; BBBY's $574.9M FY2021 buyback in a year it
lost $560M and generated $18M of operating cash; CVNA's interest expense up
2.8x to $486M plus Garcia-family related-party leases; PTON's negative (11)%
hardware gross margin; SI's 99.5% noninterest-bearing deposits, ~58% from
crypto exchanges.

docs/eval/investment-material-facts.md is the answer key + provenance ledger:
per-item filing URL and value, an honest curation-bias disclosure (downside/
outcome-selection skew, the lens distribution, single-source provenance), and
a drop log of candidate items that could not be independently sourced at the
cutoff (incl. First Republic, dropped because its 10-K is FDIC-filed not on
EDGAR) — dropped rather than guessed.

Offline tests (10) assert structure, provenance integrity (sourceUrl CIK
matches the company), the >= 18-month cutoff floor, and that the deterministic
grader surfaces a fact from its evidence, misses on filler, and scores a
surface-only thesis below the held-out bar.
tangletools
tangletools previously approved these changes Jun 25, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 2bfc0f72

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T13:11:27Z

…librated as a gate

Build the held-out investment-research loop into a runnable task and a KB-reading
metric, then CALIBRATE the metric as a gate before any A/B.

- materialFactsSurfaced(kb, checklist): reads a KB (root dir or index) and grades
  its pages + source text against a company's held-out material-fact checklist
  with the existing $0, model-free substring grader. Pure core
  (materialFactsSurfacedInText) so calibration scores a thesis string directly.
- runInvestmentThesisTask({company,ticker,cik,cutoff}): drives the existing
  two-agent research loop (web worker + driver) over web + SEC EDGAR with
  analyst-lens readiness specs, then synthesizes a thesis page into the KB. Builds
  nothing new for the loop — composes worker/driver/loop + a synthesis pass.
- Move the checklist + grader from tests/ into src/ (shippable eval primitive) so
  the metric can import it without crossing the src rootDir boundary; tests/ keeps
  a thin re-export shim. Export the metric + task + checklist from the index.

CALIBRATION GATE (offline, $0 — the binding result): shallow ticker-summary
theses surface 1/27 (4%), deep filings-grounded theses surface 27/27 (100%), a
+96pt gap; every company clears shallow<30% / deep>70%. The metric discriminates
research depth, not collection — so the A/B may proceed. An anti-circularity test
asserts the deep theses do not verbatim-embed the checklist evidence.

Calibration caught + fixed three over-loose SI grader groups (fired on bare
crypto / grew rapidly / proprietary) that let a surface summary score 60%;
tightened to require the buried signal (concentration / $14.3B / SEN).

Live path cost-gated: glm-5.2 smoke OK; web search returns the real CVNA 10-K on
sec.gov; CVNA live pilot surfaced 2/5 facts for $0.027 from the fetched filing.

Docs: docs/results/investment-calibration.md. lint/typecheck/build clean,
184 offline tests pass.
tangletools
tangletools previously approved these changes Jun 25, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — d3cbcb47

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T13:37:02Z

Add the single-agent collection baseline (createCollectionResearchDriver) and
extend the investment-thesis A/B to run all three topologies — collection /
verify / driving — over the five held-out companies at matched compute, scoring
each KB with the firewalled materialFactsSurfaced metric and pricing each arm
from RouterClient.usage().

Live result (5 companies x 3 arms x 3 rounds, glm-5.2, $0.36): driving surfaces
the most buried facts (16/27, 59%) vs blind collection (11/27, 41%) for ~1.9x
the cost, but at n=5 the paired-bootstrap CI crosses zero (P(delta<=0)=0.08) —
promising, under-powered, not a clean win. Verify does not beat collection
(10/27) and on two companies its strict primary-source gate zeroes the KB
(autopsied: it correctly rejects aggregators the worker fetched in place of the
EDGAR filing). Full writeup + per-company matrix + significance in
docs/results/investment-thesis.md.
tangletools
tangletools previously approved these changes Jun 25, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 338bc547

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T14:35:12Z

…ibration + 3-arm A/B in one section

Rewrite docs/results/investment-thesis.md as one coherent paper section covering all
three parts of the result, and fold the standalone calibration doc into it (the two
files said the same thing; one result, one doc, matching the repo's docs/results/*.md
convention):

- §1 the domain reframe: we left ML-paper retrieval — where a single search already
  collects the answer, so the metric could only measure collection and the §9 topology
  A/B came back a structural null — for investment research, where the material facts
  are buried in 10-K footnotes and a single fetch provably cannot surface them, so a
  topology A/B is finally well-posed.
- §2 the calibration (the gate the ML exam passed only weakly): the $0 model-free
  metric discriminates depth on the harder domain — shallow ticker summary 1/27 (4%),
  deep filings-grounded thesis 27/27 (100%), +96pp — with the anti-circularity guard
  and the autopsied Silvergate grader-tightening kept inline. Reproducible offline.
- §3-§5 the live 3-arm verdict, unchanged numbers: driving surfaces the most buried
  facts (16/27, 59%) vs blind collection (11/27, 41%) for ~1.9x the cost, but at n=5
  the paired-bootstrap CI crosses zero (P(delta<=0)=0.08); verify does not beat
  collection (10/27) and on two companies its strict primary-only gate zeroes the KB.
  Honest: no topology significantly beats collection — driving only leans positive.

Reference the unified result from the main paper (docs/two-agent-research-ab.md): a new
§9.1 connecting the ML null ("the domain was too easy") to the reframe, plus the
per-result link list at the tail.

A/B numbers transcribed verbatim from commit 338bc54's recorded live run; calibration
numbers verified green offline (pnpm test investment-calibration). Docs-only change.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 64a9a706

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T14:41:43Z

@drewstone drewstone changed the title test(eval): held-out investment-research eval set (5 companies, 27 material facts) feat(research): investment-thesis research eval — material-facts-surfaced, a metric that needs investigation not retrieval Jun 25, 2026
@drewstone drewstone merged commit 5e9adb0 into main Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants