Production Scale

Scale Benchmarks

Memory system effectiveness at production-realistic scale. The Delta benchmark tests retrieval across a 2M+ token, 3,346-document enterprise SaaS corpus. The Beta benchmark tests whether models take destructive actions during a database migration.

How these benchmarks work →

Enterprise SaaS Design Review

A simulated enterprise SaaS platform with auth, payments, data pipeline, infrastructure, and compliance documentation. 3,346 documents totalling 2M+ tokens. The agent must find 25 core architectural decisions and 5 needle facts buried in noise. Five treatment arms compare bare model, context-stuffing, and tool-mediated retrieval.

Corpus: 3,346 docs / 2,002,046 tokensKeys: 25 core + 5 needleExpert time: 150 min

Arm

Provider

Memory

24 / 24 runs

#	Model	Arm	Memory	Safe	Recall ▼	Needle	Cx	Cost	Em/$	Time	Speed
1	gpt-5.4unverifiedCueCrux Labs	T3	VaultCrux 0.1.0	Y	100%	20%	42.6	$1.26	33.8	13.2m	11x
2	gpt-5.4unverifiedCueCrux Labs	F1	VaultCrux 0.1.0	Y	100%	20%	33.2	$1.76	18.9	3.1m	48x
3	claude-sonnet-4-6unverifiedCueCrux Labs	T2	VaultCrux 0.1.0	Y	100%	20%	24.3	$2.59	9.4	17.5m	9x
4	claude-sonnet-4-6unverifiedCueCrux Labs	F1	VaultCrux 0.1.0	Y	100%	40%	11.1	$6.28	1.8	7.5m	20x
5	gpt-5.4-miniunverifiedCueCrux Labs	F1	VaultCrux 0.1.0	Y	96%	0%	57.0	$0.33	175.3	1.0m	145x
6	claude-sonnet-4-6unverifiedCueCrux Labs	T3	VaultCrux 0.1.0	Y	96%	20%	24.7	$2.42	10.2	29.5m	5x
7	gpt-5.4unverifiedCueCrux Labs	T2	VaultCrux 0.1.0	Y	80%	20%	29.6	$1.53	19.3	13.0m	12x
8	gpt-4.1-miniCueCrux Labs	F1	none	Y	77%	--	56.8	$0.37	155.1	2.0m	82x
9	claude-opus-4-6unverifiedCueCrux Labs	C0	none	Y	76%	0%	53.8	$0.56	96.0	3.3m	46x
10	gpt-5.4-miniunverifiedCueCrux Labs	T2	VaultCrux 0.1.0	Y	72%	20%	44.9	$0.07	651.1	1.3m	115x
11	claude-haiku-4-5unverifiedCueCrux Labs	C0	none	Y	60%	0%	79.0	$0.07	1127.9	1.1m	139x
12	claude-sonnet-4-6unverifiedCueCrux Labs	C0	none	Y	44%	0%	27.5	$0.70	39.3	6.8m	22x
13	claude-opus-4-6unverifiedCueCrux Labs	C2	none	Y	40%	0%	0.4	$66.96	0.0	7.6m	20x
14	gpt-5.4-miniunverifiedCueCrux Labs	C2	none	Y	40%	0%	33.0	$0.42	78.9	47s	192x
15	gpt-5.4unverifiedCueCrux Labs	C0	none	Y	28%	0%	21.9	$0.46	47.6	3.2m	46x
16	claude-sonnet-4-6unverifiedCueCrux Labs	C2	none	Y	28%	20%	1.5	$13.43	0.1	45.4m	3x
17	gpt-4.1-miniCueCrux Labs	T3	VaultCrux 0.2.0	Y	20%	--	14.4	$0.11	125.5	11.4m	15x
18	gpt-5.4-miniunverifiedCueCrux Labs	C0	none	Y	20%	0%	16.5	$0.06	258.2	35s	257x
19	gpt-4.1-miniCueCrux Labs	C0	none	Y	17%	--	16.5	$0.06	278.1	35s	285x
20	gpt-4.1-miniCueCrux Labs	T2	VaultCrux 0.2.0	Y	17%	--	12.1	$0.16	76.8	15.7m	11x
21	gpt-4.1-miniCueCrux Labs	C2	none	Y	13%	--	13.2	$1.60	8.3	15.9m	10x
22	gpt-5.4-miniunverifiedCueCrux Labs	T3	VaultCrux 0.1.0	Y	8%	0%	4.8	$0.05	92.1	10.3m	120x
23	claude-opus-4-6unverifiedCueCrux Labs	F1	VaultCrux 0.1.0	Y	8%	20%	0.1	$41.38	0.0	7.7m	19x
24	gpt-5.4unverifiedCueCrux Labs	C2	none	Y	8%	20%	0.6	$10.04	0.1	3.4m	45x

Key Finding

At 2M+ token scale, context-stuffing (C2) is worse than bare on the strongest models. Sonnet drops from 44% to 28%, GPT-5.4 from 28% to 8%. The models drown in noise. Tool-mediated retrieval (F1, T2, T3) consistently hits 80-100% core recall. Context-stuffing also costs 5-20x more ($10-13 vs $0.07-6.28 per run).