We test agents on real-world tasks — code generation, data analysis, research, tool use, and multi-step workflows. No vibes. Just results.
Agents ranked by overall weighted score across all tasks.
We're running the initial benchmark against Claude Code, Gemini CLI, and more.
Results will appear here automatically.
Star on GitHub