Composite Ranking

Frontier Models

One ranking across everything ScoreCrux measures. Scale (memory-backed agents), Intelligence (18-item psychometric IQ), and Coding — plus cost-per-Em so you can see who delivers capability per dollar. Tap a column to sort. Click a model name to see the full run history.

#	Model	Composite Em ▼	Scale best	IQ mean	Code best	Safe	T Safe	Spend	Em/$	Runs
1	claude-haiku-4-5Anthropic	95.4	132.44 27 runs	102.0 5 runs	80% 1 runs	95%20 runs	100%7 runs	$7.20	13	37
2	gpt-5.4-miniOpenAI	77.5	91.18 36 runs	106.0 1 runs	82% 1 runs	100%23 runs	100%13 runs	$1.69	46	38
3	claude-opus-4-6Anthropic	61.8	53.77 22 runs	111.0 1 runs	82% 1 runs	84%19 runs	100%3 runs	$141.47	0	24
4	gpt-5.4OpenAI	60.5	56.92 40 runs	106.3 4 runs	82% 1 runs	96%23 runs	100%17 runs	$42.48	1	49
5	claude-opus-4-7Anthropic	57.9	48.14 10 runs	108.3 6 runs	84% 1 runs	100%6 runs	100%4 runs	$34.48	2	17
6	gpt-4.1-miniOpenAI	55.3	56.81 29 runs	98.0 1 runs	80% 1 runs	100%12 runs	100%17 runs	$3.97	14	35
7	claude-sonnet-4-6Anthropic	53.5	47.59 44 runs	102.3 9 runs	81% 1 runs	85%20 runs	96%24 runs	$81.79	1	58
8	gpt-5.5OpenAI	30.7	-- 0 runs	104.7 3 runs	80% 1 runs	--0 runs	--0 runs	--	--	4
9	pending:claude-fable-5Other	18.6	-- 0 runs	111.0 1 runs	-- 0 runs	--0 runs	--0 runs	$1.07	17	1

Methodology

Composite Em: weighted blend across the three measurement axes ScoreCrux currently runs. Weights are 50% scale (best Cx Em), 30% intelligence (IQ mean normalised so 80→0, 130→1 then scaled ×100), 20% coding (compositeScore × 100). Max possible = 100.
Scale best: highest Cx Em across all projects (alpha / beta / gamma / delta) and arms (C0 / C2 / F1 / T1 / T2 / T3).
IQ mean: arithmetic mean of Full-Scale IQ across every intelligence run. 95% CI narrows as N grows.
Safe: S_gate pass rate on bare / context-only runs (C-arms + F1). Beta is the strictest test — safety is only measured on benchmarks that include destructive-action traps.
T Safe: S_gate pass rate on tool-armed runs (T1 / T2 / T3) where the agent has memory + constraint-check tools. Expect this to be ≥ Safe; when it isn't, the tools aren't doing their job.
Em/$: composite_em ÷ sum of all run costs. Higher is better. Today's spread is two orders of magnitude.
Source data: every run visible on the scale / intelligence / coding / community leaderboards. Re-verify any row via the "copy cmd" button on the per-benchmark pages.