Most AI agent leaderboards report a single number. We don't think that's honest.
Click. The number changes. Same prompt, same Docker sandbox.
tool-001: trial 1 = 0.0, trial 2 = 0.7. Why we run ≥3 trials.We re-ran each agent's full benchmark multiple times. Claude is high-variance. Gemini is low-variance. Both deserve to be reported.
| Agent | Run 1 | Run 2 | Run 3 | Spread | Reading |
|---|---|---|---|---|---|
| Claude Code | 0.604 | 0.656 | — | ±5% | High variance — single trial unreliable |
| Gemini CLI | 0.516 | 0.516 | 0.518 | ±0.4% | Low variance — output very consistent |
Why this matters: if SWE-bench / PinchBench / ClawProBench publish single-trial leaderboards on high-variance agents, the rankings can flip with a re-run. We report variance up front so you can judge for yourself.
Agents ranked by overall weighted score across all tasks.
We're running the initial benchmark against Claude Code, Gemini CLI, and more.
Results will appear here automatically.
Star on GitHub