Intelligence Benchmark
How smart is your model?
A psychometric intelligence benchmark built on Item Response Theory with Cattell-Horn-Carroll cognitive factor mapping. Measures reasoning — not recall, not retrieval, not memorised benchmarks. Produces an IQ-equivalent composite score with confidence intervals.
Every task is self-contained: all information needed to solve it is in the prompt. No encyclopaedic knowledge. No web access. No hidden memory advantages. Pure reasoning ability, measured fairly across architectures.

IQ-Equivalent Scale
Wechsler Classification Bands
< 70
Very Low
70–79
Low
80–89
Low Average
90–109
Average
110–119
High Average
120–129
Superior
130+
Very Superior
Rankings
Leaderboard
Showing 7 stacked configs from 24 underlying runs. Click a row to see category + factor breakdown and run history.
| # | Model | IQ | Correct | 95% CI | Runs | Cost | Time |
|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4-6CueCrux Labs | 111 | 11/18 | 96-127 | 1 | $0.86 | 2.8m |
| 2 | claude-opus-4-7*CueCrux Labs | 111 | 11/18 | 102-115 | 6 | $2.21 | 1.3m |
| 3 | claude-sonnet-4-6*CueCrux Labs | 106 | 10/18 | 96-108 | 6 | $0.90 | 2.1m |
| 4 | gpt-5.4*CueCrux Labs | 106 | 10/18 | 99-114 | 4 | $0.24 | 1.3m |
| 5 | gpt-5.4-miniCueCrux Labs | 106 | 10/18 | 91-122 | 1 | $0.01 | 32s |
| 6 | claude-haiku-4-5*CueCrux Labs | 98 | 8/18 | 95-109 | 5 | $0.24 | 1.3m |
| 7 | gpt-4.1-miniCueCrux Labs | 98 | 8/18 | 83-114 | 1 | $0.01 | 56s |
Design Principles
What this benchmark tests
Reasoning, not recall
Tasks are self-contained. All information needed is in the prompt. No factual knowledge, no web access, no hidden memory.
Architecture-neutral
No provider-specific prompt wrappers. Entrants declare an adapter. The task and scoring are identical for all.
Psychometrically grounded
Maps to CHC cognitive factors used in human IQ testing. IRT calibration produces statistically meaningful ability estimates.
Contamination-resistant
Synthetic names, procedurally generated variants, hidden holdout pool. No classic puzzles. No training data leakage.
Cognitive Factors
CHC Factor Mapping
Six reasoning categories map to four broad Cattell-Horn-Carroll stratum-II cognitive factors. Categories A, D, and E together constitute a strong fluid intelligence (Gf) measure — the single strongest predictor of general intelligence.
| Cat | Label | Primary Factor | Secondary | Weight | Task Types |
|---|---|---|---|---|---|
| A | Deduction & Elimination | GfFluid Reasoning | — | 1.0 | Logic grids, process of elimination, who-sits-where puzzles |
| B | Stateful Process Reasoning | GwmWorking Memory | — | 1.0 | Variables updating each round, state tracking across steps |
| C | Rule Application | GcComprehension-Knowledge | Gf (0.4) | 0.6 | Apply a policy or rulebook to a scenario |
| D | Causal & Counterfactual | GfFluid Reasoning | — | 1.0 | What happens next, what changes if X is removed |
| E | Abstraction & Transformation | GfFluid Reasoning | — | 1.0 | Symbol transforms, sequence rules, Raven's-like patterns |
| F | Planning Under Constraints | GsProcessing Speed | Gf (0.4) | 0.6 | Schedule tasks under dependencies and capacity limits |
Scoring
Three-Layer Methodology
1. Per-Item Scoring
Each task scored on correctness (70%), trace consistency (15%), constraint adherence (10%), output compliance (5%).
2. IRT Ability Estimation
Each item has calibrated difficulty (b) and discrimination (a) parameters. After scoring, a latent ability parameter (theta) is estimated via Maximum Likelihood, with Expected A Posteriori fallback for edge cases.
P(correct | θ) = 1 / (1 + e^(-a(θ - b)))3. IQ Conversion
Theta is transformed to an IQ-equivalent on M=100, SD=15 (Wechsler convention). Normed against model populations. 95% confidence interval from standard error.
IQ = 100 + 15 × (θ - μ) / σFair Comparison
Run Modes
Every result declares its run mode. Different modes are never directly compared on the same leaderboard.
Closed Prompt
No tools, no internet, no memory. Pure reasoning from the prompt alone.
Local Tooling
Local execution tools allowed. No internet access.
Open Tooling
Tools and web access allowed within declared rules.
Custom Harness
Entrant supplies own orchestration within declared limits.
Benchmark Structure
18 items, 6 categories, 3 tiers
18
Items per run
3 per category across 3 difficulty tiers
6
Reasoning categories
Deduction, stateful, rules, causal, abstraction, planning
4
CHC cognitive factors
Gf (Fluid), Gwm (Working Memory), Gc (Comprehension), Gs (Speed)
Quick Start
Run the benchmark
CLI
# Run all 18 items (closed prompt mode)
npx tsx benchmarks/intelligence/run-intelligence.ts --model claude-sonnet-4-20250514
# Run specific categories
npx tsx benchmarks/intelligence/run-intelligence.ts --model gpt-5.4 --categories A,D,E
# Dry run (no API calls, validates fixture loading)
npx tsx benchmarks/intelligence/run-intelligence.ts --dry-run --verbose