Composite Ranking

Frontier Models

One ranking across everything ScoreCrux measures. Scale (memory-backed agents), Intelligence (18-item psychometric IQ), and Coding — plus cost-per-Em so you can see who delivers capability per dollar. Tap a column to sort. Click a model name to see the full run history.

Frontier Models
#ModelComposite Em ▼Scale bestIQ meanCode bestSafeT SafeSpendEm/$Runs
1claude-haiku-4-5Anthropic95.4132.44 27 runs102.0 5 runs80% 1 runs95%20 runs100%7 runs$7.201337
2gpt-5.4-miniOpenAI77.591.18 36 runs106.0 1 runs82% 1 runs100%23 runs100%13 runs$1.694638
3claude-opus-4-6Anthropic61.853.77 22 runs111.0 1 runs82% 1 runs84%19 runs100%3 runs$141.47024
4gpt-5.4OpenAI60.556.92 40 runs106.3 4 runs82% 1 runs96%23 runs100%17 runs$42.48149
5claude-opus-4-7Anthropic57.948.14 10 runs108.3 6 runs84% 1 runs100%6 runs100%4 runs$34.48217
6gpt-4.1-miniOpenAI55.356.81 29 runs98.0 1 runs80% 1 runs100%12 runs100%17 runs$3.971435
7claude-sonnet-4-6Anthropic53.247.59 44 runs101.8 6 runs81% 1 runs85%20 runs96%24 runs$81.79155

Methodology

  • Composite Em: weighted blend across the three measurement axes ScoreCrux currently runs. Weights are 50% scale (best Cx Em), 30% intelligence (IQ mean normalised so 80→0, 130→1 then scaled ×100), 20% coding (compositeScore × 100). Max possible = 100.
  • Scale best: highest Cx Em across all projects (alpha / beta / gamma / delta) and arms (C0 / C2 / F1 / T1 / T2 / T3).
  • IQ mean: arithmetic mean of Full-Scale IQ across every intelligence run. 95% CI narrows as N grows.
  • Safe: S_gate pass rate on bare / context-only runs (C-arms + F1). Beta is the strictest test — safety is only measured on benchmarks that include destructive-action traps.
  • T Safe: S_gate pass rate on tool-armed runs (T1 / T2 / T3) where the agent has memory + constraint-check tools. Expect this to be ≥ Safe; when it isn't, the tools aren't doing their job.
  • Em/$: composite_em ÷ sum of all run costs. Higher is better. Today's spread is two orders of magnitude.
  • Source data: every run visible on the scale / intelligence / coding / community leaderboards. Re-verify any row via the "copy cmd" button on the per-benchmark pages.