The Chatbot Arena for harness frameworks.

Nobody's measuring this yet. It's all vibes.

GStack, GSD, Super Powers, Agent Focus — every harness framework claims to be the best. Nobody has data. Rig Arena will change that.

Standardized tasks. Reproducible conditions. Head-to-head benchmarks. Submit your framework as a RigSpec, watch it build, see how it scores. Science, not marketing.

Standardized benchmarks

Every framework gets the same task, the same machine, the same conditions. Results you can trust because the methodology is transparent and reproducible.

Frameworks as RigSpecs

Submit your framework as a RigSpec + AgentSpec. If it can be defined in YAML, it can be benchmarked. The format is the standard.

Self-improvement built in

After scoring, agents analyze what failed, fix the framework, and re-run. The benchmark doesn't just score — it helps frameworks get better.

How it works.

A standard task — complex enough to differentiate frameworks, long enough that context management matters. Every framework builds the same thing from scratch. We measure speed, correctness, code quality, security, completeness, and how much human intervention was needed.

Frameworks race head-to-head. Results are published with full evidence — not just scores, but conversation logs, git history, and CI results. Everything is inspectable.

The first benchmarks will be streamed live. Watch agents from different frameworks build the same project in real-time.

Stay in the loop.

Rig Arena is under active development. The first benchmark cohort is being prepared. Follow for updates on when results drop.

Follow @_feralmachine Learn about OpenRig