feat(research): investment-thesis research eval — material-facts-surfaced, a metric that needs investigation not retrieval by drewstone · Pull Request #40 · tangle-network/agent-knowledge

drewstone · 2026-06-25T13:11:20Z

What

Investment-thesis research eval: a domain reframe, a calibrated metric, and a controlled 3-arm topology A/B — the full result, on the branch and written up.

We moved the research eval off ML-paper retrieval — where a single web search already returns the answer, so the metric could only measure collection and the prior topology A/B came back a structural null — onto investment research, where the material facts are buried in 10-K footnotes and genuinely require investigation to surface. That is the domain where a smarter coordinator finally has room to beat blind collection, so the A/B is well-posed.

The three parts of the result

1. The held-out set — tests/eval/investment-thesis-set.ts: 5 public companies, each cutoff >= 18 months ago, 27 material facts across 8 analyst lenses, every fact read live out of the primary SEC EDGAR 10-K, every value knowable at the cutoff (the eventual collapse is recorded for the reader only and is never graded). Grading is a $0, model-free substring check; the checklist is firewalled.

2. Calibration — does the metric discriminate depth? (pnpm test investment-calibration, $0, offline, the binding gate): a deliberately shallow ticker summary scores 1/27 (4%), a deep filings-grounded thesis scores 27/27 (100%) — a +96-point gap, every company clearing the bars (shallow < 30%, deep > 70%). An anti-circularity guard asserts no deep thesis verbatim-embeds the checklist evidence, and calibration itself caught + closed three over-loose Silvergate grader groups before any A/B depended on them. This is the gate the ML deep-question exam passed only in spirit — same discrimination, materially harder domain.

3. The live 3-arm A/B (5 companies × 3 arms × 3 rounds, glm-5.2, $0.36, matched compute):

arm	material facts	cost (5 cos.)
collection (blind)	11/27 (41%)	$0.082
verify/dedup	10/27 (37%)	$0.125
driving (deepen)	16/27 (59%)	$0.157

Honest verdict: no topology significantly beats collection. Driving surfaces the most buried facts (+5 over collection, +18pp) for ~1.9× the cost, but at n=5 the paired-bootstrap CI crosses zero (P(Δ≤0)=0.08) — promising, under-powered, consistent with the project's prior topology nulls. Verify does not beat collection, and on two companies its strict primary-source gate zeroes the KB (autopsied: it correctly rejects the aggregator pages the worker fetched in place of the EDGAR filing). Every dollar is real per-call provenance from RouterClient.usage().

This PR's doc change

Unifies the writeup into one coherent section — docs/results/investment-thesis.md (§1 reframe, §2 calibration, §3-§5 the A/B verdict + autopsy) — folding in the standalone calibration doc so there is one result, one doc, matching the repo's docs/results/*.md convention. References it from the main paper (docs/two-agent-research-ab.md §9.1 + the per-result link list).

Provenance + verification

Calibration numbers reproducible offline ($0): pnpm test investment-calibration — green.
A/B numbers transcribed verbatim from the live run recorded in commit 338bc54.
Per-fact provenance, curation-bias disclosure, and the drop log: docs/eval/investment-material-facts.md.
Gates on the branch: pnpm typecheck clean, pnpm build success, pnpm lint exit 0 (2 pre-existing warnings in unrelated src/lint.ts/src/wikilinks.ts), pnpm test 184 passed / 12 live-gated skipped.

Limitations

Does not prove driving significantly beats collection (n=5, CI crosses zero). The 27-fact checklist carries a documented curation bias (downside-risk, distressed-name skew). Next rung: expand the held-out set past 5 companies and re-run at n>=24 so a +1-fact/company effect can clear a paired bootstrap; driving is the arm worth funding that test on, verify is not.

…terial facts) Add tests/eval/investment-thesis-set.ts: a firewalled eval set for the research loop, mirroring tests/loops/held-out-exam.ts. Give a loop a company + ticker + an as-of cutoff and grade the thesis it writes BLIND against held-out material facts it never saw — so a high score is thesis quality (surfacing buried, non-obvious drivers a one-shot ticker search misses), not teaching-to-the-test. Grading is a $0, model-free substring check, so the answer key never reaches a model the loop could observe. Five public companies, cutoff >= 18 months ago (outcome known but never a checklist item): SVB Financial (SIVB), Bed Bath & Beyond (BBBY), Carvana (CVNA), Peloton (PTON), Silvergate (SI). 27 facts across 8 analyst lenses (concentration, leverage, margin-trend, liquidity, capital-return, governance, off-balance-sheet, regulatory). Every fact is grounded in the company's primary SEC EDGAR 10-K filed on or before the cutoff, with the source URL (CIK-verified) and the literal value read from the filing — e.g. SIVB's ~$15.1B held-to-maturity unrealized loss ($91,321M amortized vs $76,169M fair value) sitting only in the footnotes against $16,004M total equity; BBBY's $574.9M FY2021 buyback in a year it lost $560M and generated $18M of operating cash; CVNA's interest expense up 2.8x to $486M plus Garcia-family related-party leases; PTON's negative (11)% hardware gross margin; SI's 99.5% noninterest-bearing deposits, ~58% from crypto exchanges. docs/eval/investment-material-facts.md is the answer key + provenance ledger: per-item filing URL and value, an honest curation-bias disclosure (downside/ outcome-selection skew, the lens distribution, single-source provenance), and a drop log of candidate items that could not be independently sourced at the cutoff (incl. First Republic, dropped because its 10-K is FDIC-filed not on EDGAR) — dropped rather than guessed. Offline tests (10) assert structure, provenance integrity (sourceUrl CIK matches the company), the >= 18-month cutoff floor, and that the deterministic grader surfaces a fact from its evidence, misses on filler, and scores a surface-only thesis below the held-out bar.

tangletools

✅ Auto-approved PR — `2bfc0f72`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T13:11:27Z}

…librated as a gate Build the held-out investment-research loop into a runnable task and a KB-reading metric, then CALIBRATE the metric as a gate before any A/B. - materialFactsSurfaced(kb, checklist): reads a KB (root dir or index) and grades its pages + source text against a company's held-out material-fact checklist with the existing $0, model-free substring grader. Pure core (materialFactsSurfacedInText) so calibration scores a thesis string directly. - runInvestmentThesisTask({company,ticker,cik,cutoff}): drives the existing two-agent research loop (web worker + driver) over web + SEC EDGAR with analyst-lens readiness specs, then synthesizes a thesis page into the KB. Builds nothing new for the loop — composes worker/driver/loop + a synthesis pass. - Move the checklist + grader from tests/ into src/ (shippable eval primitive) so the metric can import it without crossing the src rootDir boundary; tests/ keeps a thin re-export shim. Export the metric + task + checklist from the index. CALIBRATION GATE (offline, $0 — the binding result): shallow ticker-summary theses surface 1/27 (4%), deep filings-grounded theses surface 27/27 (100%), a +96pt gap; every company clears shallow<30% / deep>70%. The metric discriminates research depth, not collection — so the A/B may proceed. An anti-circularity test asserts the deep theses do not verbatim-embed the checklist evidence. Calibration caught + fixed three over-loose SI grader groups (fired on bare crypto / grew rapidly / proprietary) that let a surface summary score 60%; tightened to require the buried signal (concentration / $14.3B / SEN). Live path cost-gated: glm-5.2 smoke OK; web search returns the real CVNA 10-K on sec.gov; CVNA live pilot surfaced 2/5 facts for $0.027 from the fetched filing. Docs: docs/results/investment-calibration.md. lint/typecheck/build clean, 184 offline tests pass.

tangletools

✅ Auto-approved PR — `d3cbcb47`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T13:37:02Z}

Add the single-agent collection baseline (createCollectionResearchDriver) and extend the investment-thesis A/B to run all three topologies — collection / verify / driving — over the five held-out companies at matched compute, scoring each KB with the firewalled materialFactsSurfaced metric and pricing each arm from RouterClient.usage(). Live result (5 companies x 3 arms x 3 rounds, glm-5.2, $0.36): driving surfaces the most buried facts (16/27, 59%) vs blind collection (11/27, 41%) for ~1.9x the cost, but at n=5 the paired-bootstrap CI crosses zero (P(delta<=0)=0.08) — promising, under-powered, not a clean win. Verify does not beat collection (10/27) and on two companies its strict primary-source gate zeroes the KB (autopsied: it correctly rejects aggregators the worker fetched in place of the EDGAR filing). Full writeup + per-company matrix + significance in docs/results/investment-thesis.md.

tangletools

✅ Auto-approved PR — `338bc547`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T14:35:12Z}

…ibration + 3-arm A/B in one section Rewrite docs/results/investment-thesis.md as one coherent paper section covering all three parts of the result, and fold the standalone calibration doc into it (the two files said the same thing; one result, one doc, matching the repo's docs/results/*.md convention): - §1 the domain reframe: we left ML-paper retrieval — where a single search already collects the answer, so the metric could only measure collection and the §9 topology A/B came back a structural null — for investment research, where the material facts are buried in 10-K footnotes and a single fetch provably cannot surface them, so a topology A/B is finally well-posed. - §2 the calibration (the gate the ML exam passed only weakly): the $0 model-free metric discriminates depth on the harder domain — shallow ticker summary 1/27 (4%), deep filings-grounded thesis 27/27 (100%), +96pp — with the anti-circularity guard and the autopsied Silvergate grader-tightening kept inline. Reproducible offline. - §3-§5 the live 3-arm verdict, unchanged numbers: driving surfaces the most buried facts (16/27, 59%) vs blind collection (11/27, 41%) for ~1.9x the cost, but at n=5 the paired-bootstrap CI crosses zero (P(delta<=0)=0.08); verify does not beat collection (10/27) and on two companies its strict primary-only gate zeroes the KB. Honest: no topology significantly beats collection — driving only leans positive. Reference the unified result from the main paper (docs/two-agent-research-ab.md): a new §9.1 connecting the ML null ("the domain was too easy") to the reframe, plus the per-result link list at the tail. A/B numbers transcribed verbatim from commit 338bc54's recorded live run; calibration numbers verified green offline (pnpm test investment-calibration). Docs-only change.

tangletools

✅ Auto-approved PR — `64a9a706`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T14:41:43Z}

tangletools previously approved these changes Jun 25, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via d3cbcb4 June 25, 2026 13:36

tangletools previously approved these changes Jun 25, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 338bc54 June 25, 2026 14:35

tangletools previously approved these changes Jun 25, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 64a9a70 June 25, 2026 14:41

tangletools approved these changes Jun 25, 2026

View reviewed changes

drewstone changed the title ~~test(eval): held-out investment-research eval set (5 companies, 27 material facts)~~ feat(research): investment-thesis research eval — material-facts-surfaced, a metric that needs investigation not retrieval Jun 25, 2026

drewstone merged commit 5e9adb0 into main Jun 25, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(research): investment-thesis research eval — material-facts-surfaced, a metric that needs investigation not retrieval#40

feat(research): investment-thesis research eval — material-facts-surfaced, a metric that needs investigation not retrieval#40
drewstone merged 4 commits into
mainfrom
feat/investment-thesis-eval

drewstone commented Jun 25, 2026 •

edited

Loading

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The three parts of the result

This PR's doc change

Provenance + verification

Limitations

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 2bfc0f72

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — d3cbcb47

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 338bc547

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 64a9a706

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewstone commented Jun 25, 2026 •

edited

Loading

✅ Auto-approved PR — `2bfc0f72`

✅ Auto-approved PR — `d3cbcb47`

✅ Auto-approved PR — `338bc547`

✅ Auto-approved PR — `64a9a706`