Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
252 changes: 252 additions & 0 deletions docs/eval/investment-material-facts.md

Large diffs are not rendered by default.

306 changes: 306 additions & 0 deletions docs/results/investment-thesis.md

Large diffs are not rendered by default.

29 changes: 28 additions & 1 deletion docs/two-agent-research-ab.md
Original file line number Diff line number Diff line change
Expand Up @@ -449,6 +449,32 @@ both earn a narrow, cost-stratified one — the verifier on misattribution and t
off-scope tail (§5), the driver only where a richer worker makes "go corroborate this"
reach a page collection can't.

### 9.1 The domain was too easy — re-running the A/B where one search is not enough

The §9 null has a structural cause, not a measurement one: **on an ML topic a single
good search already collects the answer.** Every arm finished in one effective round
because the first fetch met the readiness gate, so the driving driver — whose mechanism
is steering a *second* round — never acted. When one search suffices, there is no
investigation for a smarter coordinator to do, and the metric can only reward
collection. To ask whether topology *can ever* beat blind collection, you have to move
to a domain where the answer is buried and a single fetch provably cannot surface it.

So we did. We ported the whole apparatus — firewalled checklist, `$0` model-free
grader, matched-compute 3-arm A/B — onto **investment research**: give a loop a company
+ ticker + an as-of cutoff and grade the thesis on the buried, material 10-K-footnote
facts a ticker search misses (an HTM mark the size of a bank's equity, a 97%-uninsured
deposit base, a negative per-unit margin). First we *calibrated* the new metric and
proved it discriminates depth on this harder domain — a shallow ticker summary scores
**1/27 (4%)**, a deep filings-grounded thesis **27/27 (100%)**, a **+96-point** gap.
Then the live 3-arm A/B over 5 held-out companies: **driving surfaced the most buried
facts (16/27, 59%) vs blind collection (11/27, 41%)** for ~1.9× the cost — the lift is
real and points the right way, but at n=5 the paired-bootstrap CI still crosses zero
(P(Δ≤0)=0.08), and verify did not beat collection (10/27). So the verdict survives the
domain change: **no topology *significantly* beats collection — but on a domain where
the answer must be investigated for, driving is the only arm that even leans positive,
and it does so suggestively, not significantly.** Full reframe, calibration, and
per-company A/B: [`docs/results/investment-thesis.md`](results/investment-thesis.md).

## 10. Reproduce

The loop, the worker, the verifier, the claim-grounding mode, the adaptive driver, the
Expand Down Expand Up @@ -509,6 +535,7 @@ the A/B harnesses — [`tests/loops/`](../tests/loops/).
Per-result detail: [`docs/results/cost-quality.md`](results/cost-quality.md),
[`docs/results/claim-grounding.md`](results/claim-grounding.md),
[`docs/results/adaptive.md`](results/adaptive.md),
[`docs/results/research-driving.md`](results/research-driving.md).
[`docs/results/research-driving.md`](results/research-driving.md),
[`docs/results/investment-thesis.md`](results/investment-thesis.md) (§9.1 — the domain reframe + calibration + 3-arm A/B).
</content>
</invoke>
46 changes: 46 additions & 0 deletions src/collection-research-driver.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
/**
* The SINGLE-AGENT COLLECTION driver — the blind-collection baseline (Arm A).
*
* This is the honest null the depth A/B is measured against. The other drivers
* spend extra inference to do something differentiated:
* - `createVerifyingResearchDriver` runs an LLM gate per source (Arm B),
* - `createResearchDrivingDriver` extracts claims, tracks corroboration, and
* synthesizes deep follow-up questions to drive depth (Arm C).
*
* This driver does NONE of that. It is a pass-through: it accepts every source
* the worker proposes and contributes no research, no gating, and no steering of
* its own. The loop still dedups exact-uri duplicates before calling
* `verifySource` (that is the loop's job, not the driver's), and the default
* `foldGaps` (a plain bulleted list of the still-open readiness gaps) still folds
* the gaps into the worker's next prompt — so the worker keeps researching, but
* NOTHING intelligent sits between the worker and the knowledge base.
*
* In other words: ONE agent (the worker) collects sources round after round, and
* the "driver" is an inert rubber stamp. That is exactly what "single-agent
* collection" means — the topology with zero coordinator intelligence — so its
* material-facts score is the floor every other arm must beat to justify its
* extra inference cost.
*
* It adds NO router calls of its own: `verifySource` is a synchronous accept and
* `foldGaps` is omitted so the loop uses its built-in gap list. So Arm A's cost
* is the worker's cost alone — the cleanest possible blind-collection baseline.
*/

import type {
ResearchDriver,
ResearchSourceProposal,
SourceVerdict,
} from './two-agent-research-loop'

/**
* Build the single-agent collection driver. Accepts every source; never gates,
* never researches, never steers beyond the loop's default open-gap list. The
* worker is the only agent that thinks.
*/
export function createCollectionResearchDriver(): ResearchDriver {
return {
verifySource(_source: ResearchSourceProposal): SourceVerdict {
return { accept: true }
},
}
}
4 changes: 4 additions & 0 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ export * from './adaptive-driver'
export * from './changes'
export * from './chunking'
export * from './claim-grounding'
export * from './collection-research-driver'
export * from './discovery'
export * from './eval-readiness'
export * from './events'
Expand All @@ -12,8 +13,11 @@ export * from './graph'
export * from './ids'
export * from './indexer'
export * from './inspect'
export * from './investment-thesis-set'
export * from './investment-thesis-task'
export * from './kb-store'
export * from './lint'
export * from './material-facts-metric'
export * from './memory/index'
export * from './proposals'
export * from './propose-from-finding'
Expand Down
Loading
Loading