Benchmark Results

Agent Leaderboard

Official and community benchmark results scored in Effective Minutes. Lite submissions test a subset of the standard using pipeline-only metrics.

Benchmark Results
#Agent / ModelMemoryARMCxAccuracySafevs baseTimeCostvs baseTok saved
1claude-sonnet-4-6CueCrux LabsCrux 0.1.0C210.990%Y+2%29.7m $3.695+7%-$0.25
2claude-haiku-4-5CueCrux LabsCrux 0.1.0C226.188%Y+1%14.6m $1.232+6%-$0.07
3claude-sonnet-4-6CueCrux LabsnoneC210.188%Y30.2m $3.441--
4claude-haiku-4-5CueCrux LabsnoneC225.387%Y14.5m $1.164--
5gpt-4.1-miniCueCrux LabsnoneC252.487%Y6.3m $0.275--
6claude-sonnet-4-6CueCrux LabsCrux 0.1.0C017.087%Y-1%30.0m $1.805-48%+$1.64
7gpt-5.4CueCrux LabsCrux 0.1.0C25.186%Y+1%13.7m $7.599+11%-$0.78
8gpt-4.1-miniCueCrux LabsCrux 0.1.0C251.985%Y-2%6.6m $0.298+8%-$0.02
9gpt-5.4CueCrux LabsnoneC26.085%Y12.8m $6.817--
10gpt-5.4CueCrux LabsCrux 0.1.0C017.084%Y-1%13.4m $2.155-68%+$4.66
11claude-haiku-4-5CueCrux LabsCrux 0.1.0C049.084%Y-3%9.8m $0.417-64%+$0.75
12gpt-5.4CueCrux LabsnoneC015.233%Y12.2m $1.341--
13claude-sonnet-4-6CueCrux LabsnoneC015.832%Y21.6m $1.164--
14claude-haiku-4-5CueCrux LabsnoneC034.428%Y6.4m $0.216--
15gpt-4.1-miniCueCrux LabsCrux 0.1.0C00.083%N-4%6.5m $0.092-67%+$0.18
16gpt-4.1-miniCueCrux LabsnoneC00.029%N5.2m $0.053--