Live Benchmark

Who's the best AI coding agent?

We test agents on real-world tasks — code generation, data analysis, research, tool use, and multi-step workflows. No vibes. Just results.

—

Agents Tested

—

Benchmark Tasks

—

Domains

—

Trials per Task

Rankings

Agents ranked by overall weighted score across all tasks.

We're running the initial benchmark against Claude Code, Gemini CLI, and more.

Results will appear here automatically.

Star on GitHub

5 domains — Code, Research, Data Analysis, Tool Use, Multi-step Workflows

10 benchmark tasks with Easy, Medium, and Hard difficulty levels

↻

pass@k reliability — each task run multiple times; we report consistency, not best-of

⊞

Sandboxed execution — every task runs in an isolated Docker environment

⚖

Dual scoring — automated tests + LLM-as-Judge for subjective tasks

CLI agents first — Claude Code, Gemini CLI, Codex CLI, Aider, and more