Insights
Benchmark Summary
Cross-run analysis of all benchmark submissions. Updated live as new results arrive.
18
Total submissions
6
Models tested
1
Memory systems
52.4
Best Cx (Em)
64%
Avg accuracy
16/18
Safety pass rate
By Model
Best score per model across all submissions
| Model | Runs | Best Cx | Avg Accuracy | Avg Judge | Cost range |
|---|---|---|---|---|---|
| gpt-4.1-mini | 4 | 52.4 | 71% | -- | $0.053 – $0.298 |
| claude-haiku-4-5 | 4 | 49.0 | 72% | -- | $0.216 – $1.232 |
| claude-sonnet-4-6 | 4 | 17.0 | 74% | -- | $1.164 – $3.695 |
| gpt-5.4 | 4 | 17.0 | 72% | -- | $1.341 – $7.599 |
| other:synthetic-v16-check | 1 | 4.6 | 0% | -- | $0.100 – $0.100 |
| other:synthetic-tool-attribution-check | 1 | 4.5 | 0% | -- | $0.120 – $0.120 |
By Memory System
Performance comparison across memory systems
| Memory System | Runs | Best Cx | Avg Accuracy | Avg Judge |
|---|---|---|---|---|
| (no memory) | 10 | 52.4 | 47% | -- |
| other:Crux | 8 | 51.9 | 86% | -- |
Recent Submissions
tool-attribution-verifyother:synthetic-v16-check4.6 Em0%2026-04-16
tool-attribution-verifyother:synthetic-tool-attribution-check4.5 Em0%2026-04-16
scorecrux-bench/claude-sonnet-4-6claude-sonnet-4-610.9 Em90%2026-04-13
scorecrux-bench/claude-haiku-4-5claude-haiku-4-526.1 Em88%2026-04-13
scorecrux-bench/claude-sonnet-4-6claude-sonnet-4-610.1 Em88%2026-04-13
scorecrux-bench/claude-haiku-4-5claude-haiku-4-525.3 Em87%2026-04-13
scorecrux-bench/claude-haiku-4-5claude-haiku-4-549.0 Em84%2026-04-13
scorecrux-bench/claude-haiku-4-5claude-haiku-4-534.4 Em28%2026-04-13
scorecrux-bench/gpt-5.4gpt-5.45.1 Em86%2026-04-13
scorecrux-bench/claude-sonnet-4-6claude-sonnet-4-617.0 Em87%2026-04-13
