Composite Ranking
Frontier Models
One ranking across everything ScoreCrux measures. Scale (memory-backed agents), Intelligence (18-item psychometric IQ), and Coding — plus cost-per-Em so you can see who delivers capability per dollar. Tap a column to sort. Click a model name to see the full run history.

| # | Model | Composite Em ▼ | Scale best | IQ mean | Code best | Safe | T Safe | Spend | Em/$ | Runs |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | claude-haiku-4-5Anthropic | 95.4 | 132.44 27 runs | 102.0 5 runs | 80% 1 runs | 95%20 runs | 100%7 runs | $7.20 | 13 | 37 |
| 2 | gpt-5.4-miniOpenAI | 77.5 | 91.18 36 runs | 106.0 1 runs | 82% 1 runs | 100%23 runs | 100%13 runs | $1.69 | 46 | 38 |
| 3 | claude-opus-4-6Anthropic | 61.8 | 53.77 22 runs | 111.0 1 runs | 82% 1 runs | 84%19 runs | 100%3 runs | $141.47 | 0 | 24 |
| 4 | gpt-5.4OpenAI | 60.5 | 56.92 40 runs | 106.3 4 runs | 82% 1 runs | 96%23 runs | 100%17 runs | $42.48 | 1 | 49 |
| 5 | claude-opus-4-7Anthropic | 57.9 | 48.14 10 runs | 108.3 6 runs | 84% 1 runs | 100%6 runs | 100%4 runs | $34.48 | 2 | 17 |
| 6 | gpt-4.1-miniOpenAI | 55.3 | 56.81 29 runs | 98.0 1 runs | 80% 1 runs | 100%12 runs | 100%17 runs | $3.97 | 14 | 35 |
| 7 | claude-sonnet-4-6Anthropic | 53.2 | 47.59 44 runs | 101.8 6 runs | 81% 1 runs | 85%20 runs | 96%24 runs | $81.79 | 1 | 55 |
Methodology
- Composite Em: weighted blend across the three measurement axes ScoreCrux currently runs. Weights are 50% scale (best Cx Em), 30% intelligence (IQ mean normalised so 80→0, 130→1 then scaled ×100), 20% coding (compositeScore × 100). Max possible = 100.
- Scale best: highest Cx Em across all projects (alpha / beta / gamma / delta) and arms (C0 / C2 / F1 / T1 / T2 / T3).
- IQ mean: arithmetic mean of Full-Scale IQ across every intelligence run. 95% CI narrows as N grows.
- Safe: S_gate pass rate on bare / context-only runs (C-arms + F1). Beta is the strictest test — safety is only measured on benchmarks that include destructive-action traps.
- T Safe: S_gate pass rate on tool-armed runs (T1 / T2 / T3) where the agent has memory + constraint-check tools. Expect this to be ≥ Safe; when it isn't, the tools aren't doing their job.
- Em/$: composite_em ÷ sum of all run costs. Higher is better. Today's spread is two orders of magnitude.
- Source data: every run visible on the scale / intelligence / coding / community leaderboards. Re-verify any row via the "copy cmd" button on the per-benchmark pages.
