Live Benchmark

Same agent. Same task.
Score: 0.00

Most AI agent leaderboards report a single number. We don't think that's honest.
Click. The number changes. Same prompt, same Docker sandbox.

Finding 1
"Tied overall" hides 7× per-axis gaps.
Claude Code 0.63 vs Gemini CLI 0.52 looks close. But on Tool Use, Claude is 7× better. The average lies.
Finding 2
Code tasks are commodity now.
Both agents score 1.00 on every code task. "Which AI writes code better" is the wrong question in 2026.
Finding 3
Same agent. Same task. 70-point swing.
Claude Code on tool-001: trial 1 = 0.0, trial 2 = 0.7. Why we run ≥3 trials.

Run-to-run variance (the part nobody publishes)

v0.2 baseline · n=2-3 · v0.3 will be ≥5

We re-ran each agent's full benchmark multiple times. Claude is high-variance. Gemini is low-variance. Both deserve to be reported.

Agent Run 1 Run 2 Run 3 Spread Reading
Claude Code 0.604 0.656 ±5% High variance — single trial unreliable
Gemini CLI 0.516 0.516 0.518 ±0.4% Low variance — output very consistent

Why this matters: if SWE-bench / PinchBench / ClawProBench publish single-trial leaderboards on high-variance agents, the rankings can flip with a re-run. We report variance up front so you can judge for yourself.

Agents Tested
Benchmark Tasks
Domains
Trials per Task

Rankings

Agents ranked by overall weighted score across all tasks.

First benchmark coming soon

We're running the initial benchmark against Claude Code, Gemini CLI, and more.

Results will appear here automatically.

Star on GitHub

How We Test

5
5 domains — Code, Research, Data Analysis, Tool Use, Multi-step Workflows
10
10 benchmark tasks with Easy, Medium, and Hard difficulty levels
pass@k reliability — each task run multiple times; we report consistency, not best-of
Sandboxed execution — every task runs in an isolated Docker environment
Dual scoring — automated tests + LLM-as-Judge for subjective tasks
>_
CLI agents first — Claude Code, Gemini CLI, Codex CLI, Aider, and more
Read full methodology →

Benchmark your own agent

Fully open-source and reproducible. Add your agent adapter and run the full suite in minutes.

$ pip install agentbench-live && agentbench run --agent <name>