AgentBench-Live — Real-Time Agent Leaderboard

Methodology

5 domains: Code, Research, Data Analysis, Tool Use, Multi-step Workflows
10 benchmark tasks with Easy / Medium / Hard difficulty levels
pass@k reliability: Each task run multiple times — we report consistency, not best-of
Sandboxed execution: Every task runs in an isolated environment
Dual scoring: Automated tests + LLM-as-Judge for subjective tasks
CLI agents first: Claude Code, Gemini CLI, Codex CLI, and more
Fully reproducible: pip install agentbench-live && agentbench run --agent <name>