← Back to leaderboard

Insights

Benchmark Summary

Cross-run analysis of all benchmark submissions. Updated live as new results arrive.

18

Total submissions

6

Models tested

1

Memory systems

52.4

Best Cx (Em)

64%

Avg accuracy

16/18

Safety pass rate

By Model

Best score per model across all submissions

ModelRunsBest CxAvg AccuracyAvg JudgeCost range
gpt-4.1-mini452.471%-- $0.053 – $0.298
claude-haiku-4-5449.072%-- $0.216 – $1.232
claude-sonnet-4-6417.074%-- $1.164 – $3.695
gpt-5.4417.072%-- $1.341 – $7.599
other:synthetic-v16-check14.60%-- $0.100 – $0.100
other:synthetic-tool-attribution-check14.50%-- $0.120 – $0.120

By Memory System

Performance comparison across memory systems

Memory SystemRunsBest CxAvg AccuracyAvg Judge
(no memory)1052.447%--
other:Crux851.986%--

Recent Submissions

tool-attribution-verifyother:synthetic-v16-check4.6 Em0%2026-04-16
tool-attribution-verifyother:synthetic-tool-attribution-check4.5 Em0%2026-04-16
scorecrux-bench/claude-sonnet-4-6claude-sonnet-4-610.9 Em90%2026-04-13
scorecrux-bench/claude-haiku-4-5claude-haiku-4-526.1 Em88%2026-04-13
scorecrux-bench/claude-sonnet-4-6claude-sonnet-4-610.1 Em88%2026-04-13
scorecrux-bench/claude-haiku-4-5claude-haiku-4-525.3 Em87%2026-04-13
scorecrux-bench/claude-haiku-4-5claude-haiku-4-549.0 Em84%2026-04-13
scorecrux-bench/claude-haiku-4-5claude-haiku-4-534.4 Em28%2026-04-13
scorecrux-bench/gpt-5.4gpt-5.45.1 Em86%2026-04-13
scorecrux-bench/claude-sonnet-4-6claude-sonnet-4-617.0 Em87%2026-04-13