Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CONCEPTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,16 @@ Shared domain vocabulary for this project — entities, named processes, and sta

**Provider runtime boundary** — the process boundary between AgentV's evaluation orchestrator and the agent runtime a provider invokes. CLI-backed providers place the agent runtime outside the orchestrator; in-process SDK providers share the orchestrator process and need either a targeted transport fix or subprocess-style isolation when runtime teardown can threaten run artifact finalization.

## Evaluation Model

**Eval** — The frozen task and grading definition: prompts, datasets, input files, fixtures, assertions, and judge criteria. An eval defines what is being tested, not which agent, model, setup variant, or run policy executes it.

**Experiment** — A committed run variant that selects how evals are executed: target or target matrix, setup, scripts, eval filters, repeat counts, timeouts, workers, budgets, and related run knobs. Experiments make A/B setup differences explicit while pointing at stable eval tasks.

**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `artifact_dir`, `task_dir`, `summary_path`, and `grading_path`.

**Artifact sidecar** — A file beside or below a test-case artifact directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run.

## Evaluation Reliability

**Repeat run** — A configured request to execute the same eval case and target more than once in the same timestamped run bundle. Repeat runs measure stochastic reliability, verifier stability, and drift; they are not the default CI path.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
title: "Separate eval tasks from experiment runtime"
date: 2026-06-24
category: architecture-patterns
module: evaluation model
problem_type: architecture_pattern
component: tooling
severity: medium
applies_when:
- Designing eval, experiment, or artifact contracts for AgentV
- Deciding whether setup, target selection, repeat counts, or scripts belong in eval YAML
- Aligning AgentV with external eval conventions without copying their whole product model
tags:
- experiments
- evals
- artifacts
- agent-eval
- dashboard
- repeat-runs
---

# Separate eval tasks from experiment runtime

## Context

AgentV originally treated an experiment as a string label on a run while `eval.yaml` carried both the task definition and runtime setup. That made simple runs easy, but it blurred the boundary between what is being tested and how it is being tested. It also made A/B tests awkward: setup differences such as adding skill files, installing dependencies, or changing run counts had to be pushed into eval YAML or hidden behind environment variables.

The experiment-separation work aligned AgentV with the useful part of Vercel `agent-eval`: an eval is the frozen task and assertion contract, while an experiment is the committed runtime variant that chooses targets, setup, scripts, repeat behavior, and filters. AgentV kept its own YAML-first authoring, target model, LLM graders, and dashboard artifact contracts instead of taking a hard runtime dependency on Vercel's package.

## Guidance

Keep eval definitions focused on task evidence:

- prompts, datasets, and input files
- assertions and LLM-grader criteria
- task fixtures that represent the work being evaluated

Put runtime variation in experiments:

- target or target matrix selection
- model and provider selection through existing AgentV targets
- setup steps such as installing dependencies or injecting skill files
- post-agent scripts
- timeout, workers, budgets, repeat counts, and early-exit behavior
- eval/test filters for a suite or A/B variant

This keeps A/B experiments honest. A baseline and a "with skill" variant should point at the same eval task and differ only in experiment setup. If the task itself changes, the result is not an A/B comparison.

Use external conventions as a lowest-common-denominator contract, not as a product takeover. The Vercel structure is useful for naming and layout:

```text
eval = what is tested
experiment = how it is run
run-N = one attempt inside a repeated case
```

AgentV should still preserve repo-native constraints that make it useful:

- wire formats stay `snake_case`
- YAML remains the canonical authoring path
- existing target definitions are reused instead of introducing a parallel provider schema
- dashboard and CI discovery stay anchored on root run manifests
- LLM-judge assertions remain part of evals, not experiments

## Why This Matters

The split prevents configuration drift from becoming hidden test drift. When setup lives in an experiment, reviewers can see that two variants are testing the same task. When setup lives inside eval YAML, changing the setup can silently change the meaning of the eval suite.

It also reduces future migration cost. A run can support Vercel-style experiment files, AgentV YAML experiments, repeat attempts, and dashboard browsing without forcing every consumer to understand every nested artifact. Root manifests remain the loading contract; nested files are evidence.

## When to Apply

- Adding a new run-level knob such as repeat count, timeout, workers, budget, sandbox, setup, or post-run scripts.
- Designing an example that compares one agent/model/setup against another.
- Moving a field out of eval YAML and deciding where backward compatibility should live.
- Changing artifact layout for repeat runs or dashboard browsing.
- Mapping an external eval convention into AgentV.

## Examples

**Prefer experiment setup for A/B variants:**

```yaml
name: copilot-with-skill
target: copilot
evals:
- bug-fix-*
setup:
- script: cp skills/repo-debugging/AGENTS.md ./
repeat:
count: 4
strategy: pass_at_k
early_exit: false
```

**Keep the eval task independent of the runtime variant:**

```yaml
name: bug-fix-suite
tests:
- id: bug-fix-001
input_files:
- PROMPT.md
assertions:
- type: llm-grader
target: grader
```

**Use root manifests for discovery and nested files for evidence:**

```text
.agentv/results/<experiment>/<timestamp>/
index.jsonl
benchmark.json
timing.json
<suite>/<case>/
task/PROMPT.md
summary.json
grading.json
run-1/
result.json
grading.json
transcript.json
transcript-raw.jsonl
outputs/answer.md
```

The dashboard should discover runs from root manifests and learn case locations from `index.jsonl` fields such as `artifact_dir`, `task_dir`, `summary_path`, and `grading_path`. It should not depend on optional per-attempt sidecars for discovery.

## Related

- `docs/adr/2026-06-23-experiments-vs-eval-separation.md` - architecture decision for the split
- `docs/plans/2026-06-23-002-experiments-separation-plan.md` - phased implementation plan
- `docs/plans/2026-06-23-001-feat-repeat-runs-flaky-evals-plan.md` - repeat-run placement reconciled to experiments
- `docs/solutions/best-practices/prefer-isolated-runtime-boundaries-for-agent-sdk-providers.md` - adjacent guidance on keeping provider runtime instability outside artifact finalization
Loading