Intelligence Benchmark

How smart is your model?

A psychometric intelligence benchmark built on Item Response Theory with Cattell-Horn-Carroll cognitive factor mapping. Measures reasoning — not recall, not retrieval, not memorised benchmarks. Produces an IQ-equivalent composite score with confidence intervals.

Every task is self-contained: all information needed to solve it is in the prompt. No encyclopaedic knowledge. No web access. No hidden memory advantages. Pure reasoning ability, measured fairly across architectures.

How to Run GitHub

IQ-Equivalent Scale

Wechsler Classification Bands

< 70

Very Low

70–79

Low

80–89

Low Average

90–109

Average

110–119

High Average

120–129

Superior

130+

Very Superior

Rankings

Leaderboard

Showing 9 stacked configs from 31 underlying runs. Click a row to see category + factor breakdown and run history.

#	Model	IQ	Correct	95% CI	Runs	Cost	Time
1	claude-opus-4-6CueCrux Labs	111	11/18	96-127	1	$0.86	2.8m
2	claude-opus-4-7*CueCrux Labs	111	11/18	102-115	6	$2.21	1.3m
3	pending:claude-fable-5pending reviewCueCrux Labs	111	11/18	96-127	1	$1.07	1.5m
4	claude-sonnet-4-6*CueCrux Labs	106	10/18	97-108	9	$0.90	--
5	gpt-5.4*CueCrux Labs	106	10/18	99-114	4	$0.24	1.3m
6	gpt-5.4-miniCueCrux Labs	106	10/18	91-122	1	$0.01	32s
7	gpt-5.5*CueCrux Labs	102	9/18	96-114	3	$0.00	--
8	claude-haiku-4-5*CueCrux Labs	98	8/18	95-109	5	$0.24	1.3m
9	gpt-4.1-miniCueCrux Labs	98	8/18	83-114	1	$0.01	56s

Design Principles

What this benchmark tests

Reasoning, not recall

Tasks are self-contained. All information needed is in the prompt. No factual knowledge, no web access, no hidden memory.

Architecture-neutral

No provider-specific prompt wrappers. Entrants declare an adapter. The task and scoring are identical for all.

Psychometrically grounded

Maps to CHC cognitive factors used in human IQ testing. IRT calibration produces statistically meaningful ability estimates.

Contamination-resistant

Synthetic names, procedurally generated variants, hidden holdout pool. No classic puzzles. No training data leakage.

Cognitive Factors

CHC Factor Mapping

Six reasoning categories map to four broad Cattell-Horn-Carroll stratum-II cognitive factors. Categories A, D, and E together constitute a strong fluid intelligence (Gf) measure — the single strongest predictor of general intelligence.

Cat	Label	Primary Factor	Secondary	Weight	Task Types
A	Deduction & Elimination	GfFluid Reasoning	—	1.0	Logic grids, process of elimination, who-sits-where puzzles
B	Stateful Process Reasoning	GwmWorking Memory	—	1.0	Variables updating each round, state tracking across steps
C	Rule Application	GcComprehension-Knowledge	Gf (0.4)	0.6	Apply a policy or rulebook to a scenario
D	Causal & Counterfactual	GfFluid Reasoning	—	1.0	What happens next, what changes if X is removed
E	Abstraction & Transformation	GfFluid Reasoning	—	1.0	Symbol transforms, sequence rules, Raven's-like patterns
F	Planning Under Constraints	GsProcessing Speed	Gf (0.4)	0.6	Schedule tasks under dependencies and capacity limits

Scoring

Three-Layer Methodology

1. Per-Item Scoring

Each task scored on correctness (70%), trace consistency (15%), constraint adherence (10%), output compliance (5%).

Correctness

70%

Trace Consistency

15%

Constraint Adherence

10%

Output Compliance

2. IRT Ability Estimation

Each item has calibrated difficulty (b) and discrimination (a) parameters. After scoring, a latent ability parameter (theta) is estimated via Maximum Likelihood, with Expected A Posteriori fallback for edge cases.

P(correct | θ) = 1 / (1 + e^(-a(θ - b)))

3. IQ Conversion

Theta is transformed to an IQ-equivalent on M=100, SD=15 (Wechsler convention). Normed against model populations. 95% confidence interval from standard error.

IQ = 100 + 15 × (θ - μ) / σ

Fair Comparison

Run Modes

Every result declares its run mode. Different modes are never directly compared on the same leaderboard.

Closed Prompt

No tools, no internet, no memory. Pure reasoning from the prompt alone.

No toolsNo webNo memory

Local Tooling

Local execution tools allowed. No internet access.

Local toolsNo webNo memory

Open Tooling

Tools and web access allowed within declared rules.

ToolsWebOptional memory

Custom Harness

Entrant supplies own orchestration within declared limits.

DeclaredDeclaredDeclared

Benchmark Structure

18 items, 6 categories, 3 tiers

Items per run

3 per category across 3 difficulty tiers

Reasoning categories

Deduction, stateful, rules, causal, abstraction, planning

CHC cognitive factors

Gf (Fluid), Gwm (Working Memory), Gc (Comprehension), Gs (Speed)

Quick Start

Run the benchmark

# Run all 18 items (closed prompt mode)
npx tsx benchmarks/intelligence/run-intelligence.ts --model claude-sonnet-4-20250514

# Run specific categories
npx tsx benchmarks/intelligence/run-intelligence.ts --model gpt-5.4 --categories A,D,E

# Dry run (no API calls, validates fixture loading)
npx tsx benchmarks/intelligence/run-intelligence.ts --dry-run --verbose