Intelligence Benchmark

How smart is your model?

A psychometric intelligence benchmark built on Item Response Theory with Cattell-Horn-Carroll cognitive factor mapping. Measures reasoning — not recall, not retrieval, not memorised benchmarks. Produces an IQ-equivalent composite score with confidence intervals.

Every task is self-contained: all information needed to solve it is in the prompt. No encyclopaedic knowledge. No web access. No hidden memory advantages. Pure reasoning ability, measured fairly across architectures.

Intelligence Benchmark

IQ-Equivalent Scale

Wechsler Classification Bands

< 70

Very Low

70–79

Low

80–89

Low Average

90–109

Average

110–119

High Average

120–129

Superior

130+

Very Superior

Rankings

Leaderboard

Showing 7 stacked configs from 24 underlying runs. Click a row to see category + factor breakdown and run history.

#ModelIQCorrect95% CIRunsCostTime
1claude-opus-4-6CueCrux Labs11111/1896-1271$0.862.8m
2claude-opus-4-7*CueCrux Labs11111/18102-1156$2.211.3m
3claude-sonnet-4-6*CueCrux Labs10610/1896-1086$0.902.1m
4gpt-5.4*CueCrux Labs10610/1899-1144$0.241.3m
5gpt-5.4-miniCueCrux Labs10610/1891-1221$0.0132s
6claude-haiku-4-5*CueCrux Labs988/1895-1095$0.241.3m
7gpt-4.1-miniCueCrux Labs988/1883-1141$0.0156s

Design Principles

What this benchmark tests

Reasoning, not recall

Tasks are self-contained. All information needed is in the prompt. No factual knowledge, no web access, no hidden memory.

Architecture-neutral

No provider-specific prompt wrappers. Entrants declare an adapter. The task and scoring are identical for all.

Psychometrically grounded

Maps to CHC cognitive factors used in human IQ testing. IRT calibration produces statistically meaningful ability estimates.

Contamination-resistant

Synthetic names, procedurally generated variants, hidden holdout pool. No classic puzzles. No training data leakage.

Cognitive Factors

CHC Factor Mapping

Six reasoning categories map to four broad Cattell-Horn-Carroll stratum-II cognitive factors. Categories A, D, and E together constitute a strong fluid intelligence (Gf) measure — the single strongest predictor of general intelligence.

CatLabelPrimary FactorSecondaryWeightTask Types
ADeduction & EliminationGfFluid Reasoning1.0Logic grids, process of elimination, who-sits-where puzzles
BStateful Process ReasoningGwmWorking Memory1.0Variables updating each round, state tracking across steps
CRule ApplicationGcComprehension-KnowledgeGf (0.4)0.6Apply a policy or rulebook to a scenario
DCausal & CounterfactualGfFluid Reasoning1.0What happens next, what changes if X is removed
EAbstraction & TransformationGfFluid Reasoning1.0Symbol transforms, sequence rules, Raven's-like patterns
FPlanning Under ConstraintsGsProcessing SpeedGf (0.4)0.6Schedule tasks under dependencies and capacity limits

Scoring

Three-Layer Methodology

1. Per-Item Scoring

Each task scored on correctness (70%), trace consistency (15%), constraint adherence (10%), output compliance (5%).

Correctness
70%
Trace Consistency
15%
Constraint Adherence
10%
Output Compliance
5%

2. IRT Ability Estimation

Each item has calibrated difficulty (b) and discrimination (a) parameters. After scoring, a latent ability parameter (theta) is estimated via Maximum Likelihood, with Expected A Posteriori fallback for edge cases.

P(correct | θ) = 1 / (1 + e^(-a(θ - b)))

3. IQ Conversion

Theta is transformed to an IQ-equivalent on M=100, SD=15 (Wechsler convention). Normed against model populations. 95% confidence interval from standard error.

IQ = 100 + 15 × (θ - μ) / σ

Fair Comparison

Run Modes

Every result declares its run mode. Different modes are never directly compared on the same leaderboard.

Closed Prompt

No tools, no internet, no memory. Pure reasoning from the prompt alone.

No toolsNo webNo memory

Local Tooling

Local execution tools allowed. No internet access.

Local toolsNo webNo memory

Open Tooling

Tools and web access allowed within declared rules.

ToolsWebOptional memory

Custom Harness

Entrant supplies own orchestration within declared limits.

DeclaredDeclaredDeclared

Benchmark Structure

18 items, 6 categories, 3 tiers

18

Items per run

3 per category across 3 difficulty tiers

6

Reasoning categories

Deduction, stateful, rules, causal, abstraction, planning

4

CHC cognitive factors

Gf (Fluid), Gwm (Working Memory), Gc (Comprehension), Gs (Speed)

Quick Start

Run the benchmark

CLI

# Run all 18 items (closed prompt mode)
npx tsx benchmarks/intelligence/run-intelligence.ts --model claude-sonnet-4-20250514

# Run specific categories
npx tsx benchmarks/intelligence/run-intelligence.ts --model gpt-5.4 --categories A,D,E

# Dry run (no API calls, validates fixture loading)
npx tsx benchmarks/intelligence/run-intelligence.ts --dry-run --verbose