Skip to content

Commit 3bc37da

Browse files
authored
feat: introducing 3 episodes per benchmark (#5)
* removing complex benchmark output * aggregate the 3 runs in parallel * use opencode dev version * exclude opencode from the bundle * more information for the error * fix discord-sample.ts file
1 parent 4e479b1 commit 3bc37da

9 files changed

Lines changed: 592 additions & 683 deletions

File tree

.github/workflows/publish-benchmark.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ jobs:
4545
- id: matrix
4646
name: Build benchmark matrix
4747
run: |
48-
bun add -g opencode-ai @openai/codex-sdk
48+
bun add -g opencode-ai@dev @openai/codex-sdk
4949
set -euo pipefail
5050
MATRIX_JSON="$(bun run scripts/generate-benchmark-matrix.ts)"
5151
printf 'matrix=%s\n' "${MATRIX_JSON}" >> "$GITHUB_OUTPUT"
@@ -76,7 +76,7 @@ jobs:
7676
run: bun install --frozen-lockfile
7777

7878
- name: Install OpenCode CLI
79-
run: bun add -g opencode-ai @openai/codex-sdk
79+
run: bun add -g opencode-ai@dev @openai/codex-sdk
8080

8181
- name: Determine benchmark job URL
8282
id: job_url

README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,11 @@
33
A benchmarking framework for evaluating opencode's AI coding agents across real-world GitHub repositories. The framework runs agents against target repositories and scores their outputs using multiple LLM judges, measuring code quality across dimensions like readability, functionality, adherence to best practices, and efficiency.
44

55
```bash
6-
orvl opencode # run opencode on all models x evals x scores
7-
orvl opencode --model opencode/qwen3-coder # filter by model across all evals x scores
8-
orvl opencode --eval noworneverev/graphrag-visualizer # filter by eval across models x scores
6+
orvl opencode --model opencode/gpt-5-codex --eval noworneverev/graphrag-visualizer
7+
orvl opencode --model opencode/claude-sonnet-4-5 --eval prismicio-community/course-fizzi-next --output results.json
98
```
109

11-
Filters use CLI options like `--model`, `--eval`, and `--score`.
10+
Both `--model` and `--eval` are required; the CLI now runs a single agent/model/eval pairing at a time. Each invocation executes three isolated `[episode X/3]` runs (fresh clones) and aggregates the judge scores before exporting results.
1211

1312
## Setup
1413
```bash
@@ -19,7 +18,7 @@ bun run build
1918
During development the CLI can be executed directly with Bun:
2019

2120
```bash
22-
bun run dev -- <agent> [--model <model>] [--eval <owner/name>] [--score <score>]
21+
bun run dev -- <agent> --model <model> --eval <owner/name>
2322
```
2423

2524
## Continuous Releases

agents/opencode.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,8 +110,12 @@ function serializeError(error: unknown): Record<string, unknown> {
110110
name: error.name,
111111
message: error.message,
112112
stack: error.stack,
113+
cause: error.cause ? serializeError(error.cause) : undefined,
113114
};
114115
}
116+
if (typeof error === "object" && error !== null) {
117+
return { ...error };
118+
}
115119
return { value: String(error) };
116120
}
117121

0 commit comments

Comments
 (0)