Benchmark Results
Agent Leaderboard
Official and community benchmark results scored in Effective Minutes. Lite submissions test a subset of the standard using pipeline-only metrics.

| # | Agent / Model | Memory | ARM | Cx | Accuracy | Safe | vs base | Time | Cost | vs base | Tok saved |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | claude-sonnet-4-6CueCrux Labs | Crux 0.1.0 | C2 | 10.9 | 90% | Y | +2% | 29.7m | $3.695 | +7% | -$0.25 |
| 2 | claude-haiku-4-5CueCrux Labs | Crux 0.1.0 | C2 | 26.1 | 88% | Y | +1% | 14.6m | $1.232 | +6% | -$0.07 |
| 3 | claude-sonnet-4-6CueCrux Labs | none | C2 | 10.1 | 88% | Y | 30.2m | $3.441 | -- | ||
| 4 | claude-haiku-4-5CueCrux Labs | none | C2 | 25.3 | 87% | Y | 14.5m | $1.164 | -- | ||
| 5 | gpt-4.1-miniCueCrux Labs | none | C2 | 52.4 | 87% | Y | 6.3m | $0.275 | -- | ||
| 6 | claude-sonnet-4-6CueCrux Labs | Crux 0.1.0 | C0 | 17.0 | 87% | Y | -1% | 30.0m | $1.805 | -48% | +$1.64 |
| 7 | gpt-5.4CueCrux Labs | Crux 0.1.0 | C2 | 5.1 | 86% | Y | +1% | 13.7m | $7.599 | +11% | -$0.78 |
| 8 | gpt-4.1-miniCueCrux Labs | Crux 0.1.0 | C2 | 51.9 | 85% | Y | -2% | 6.6m | $0.298 | +8% | -$0.02 |
| 9 | gpt-5.4CueCrux Labs | none | C2 | 6.0 | 85% | Y | 12.8m | $6.817 | -- | ||
| 10 | gpt-5.4CueCrux Labs | Crux 0.1.0 | C0 | 17.0 | 84% | Y | -1% | 13.4m | $2.155 | -68% | +$4.66 |
| 11 | claude-haiku-4-5CueCrux Labs | Crux 0.1.0 | C0 | 49.0 | 84% | Y | -3% | 9.8m | $0.417 | -64% | +$0.75 |
| 12 | gpt-5.4CueCrux Labs | none | C0 | 15.2 | 33% | Y | 12.2m | $1.341 | -- | ||
| 13 | claude-sonnet-4-6CueCrux Labs | none | C0 | 15.8 | 32% | Y | 21.6m | $1.164 | -- | ||
| 14 | claude-haiku-4-5CueCrux Labs | none | C0 | 34.4 | 28% | Y | 6.4m | $0.216 | -- | ||
| 15 | gpt-4.1-miniCueCrux Labs | Crux 0.1.0 | C0 | 0.0 | 83% | N | -4% | 6.5m | $0.092 | -67% | +$0.18 |
| 16 | gpt-4.1-miniCueCrux Labs | none | C0 | 0.0 | 29% | N | 5.2m | $0.053 | -- |
