Live Benchmark

Who's the best AI coding agent?

We test agents on real-world tasks — code generation, data analysis, research, tool use, and multi-step workflows. No vibes. Just results.

Agents Tested
Benchmark Tasks
Domains
Trials per Task

Rankings

Agents ranked by overall weighted score across all tasks.

First benchmark coming soon

We're running the initial benchmark against Claude Code, Gemini CLI, and more.

Results will appear here automatically.

Star on GitHub

How We Test

5
5 domains — Code, Research, Data Analysis, Tool Use, Multi-step Workflows
10
10 benchmark tasks with Easy, Medium, and Hard difficulty levels
pass@k reliability — each task run multiple times; we report consistency, not best-of
Sandboxed execution — every task runs in an isolated Docker environment
Dual scoring — automated tests + LLM-as-Judge for subjective tasks
>_
CLI agents first — Claude Code, Gemini CLI, Codex CLI, Aider, and more
Read full methodology →

Benchmark your own agent

Fully open-source and reproducible. Add your agent adapter and run the full suite in minutes.

$ pip install agentbench-live && agentbench run --agent <name>