Production Scale

Scale Benchmarks

Memory system effectiveness at production-realistic scale. The Delta benchmark tests retrieval across a 2M+ token, 3,346-document enterprise SaaS corpus. The Beta benchmark tests whether models take destructive actions during a database migration.

Testing Lab

Enterprise SaaS Design Review

A simulated enterprise SaaS platform with auth, payments, data pipeline, infrastructure, and compliance documentation. 3,346 documents totalling 2M+ tokens. The agent must find 25 core architectural decisions and 5 needle facts buried in noise. Five treatment arms compare bare model, context-stuffing, and tool-mediated retrieval.

Corpus: 3,346 docs / 2,002,046 tokensKeys: 25 core + 5 needleExpert time: 150 min
Arm
Provider
Memory
24 / 24 runs
#ModelArmMemorySafeRecall ▼NeedleCxCostEm/$ TimeSpeed
1gpt-5.4unverifiedCueCrux LabsT3VaultCrux 0.1.0Y100%20%42.6$1.2633.813.2m11x
2gpt-5.4unverifiedCueCrux LabsF1VaultCrux 0.1.0Y100%20%33.2$1.7618.93.1m48x
3claude-sonnet-4-6unverifiedCueCrux LabsT2VaultCrux 0.1.0Y100%20%24.3$2.599.417.5m9x
4claude-sonnet-4-6unverifiedCueCrux LabsF1VaultCrux 0.1.0Y100%40%11.1$6.281.87.5m20x
5gpt-5.4-miniunverifiedCueCrux LabsF1VaultCrux 0.1.0Y96%0%57.0$0.33175.31.0m145x
6claude-sonnet-4-6unverifiedCueCrux LabsT3VaultCrux 0.1.0Y96%20%24.7$2.4210.229.5m5x
7gpt-5.4unverifiedCueCrux LabsT2VaultCrux 0.1.0Y80%20%29.6$1.5319.313.0m12x
8gpt-4.1-miniCueCrux LabsF1noneY77%--56.8$0.37155.12.0m82x
9claude-opus-4-6unverifiedCueCrux LabsC0noneY76%0%53.8$0.5696.03.3m46x
10gpt-5.4-miniunverifiedCueCrux LabsT2VaultCrux 0.1.0Y72%20%44.9$0.07651.11.3m115x
11claude-haiku-4-5unverifiedCueCrux LabsC0noneY60%0%79.0$0.071127.91.1m139x
12claude-sonnet-4-6unverifiedCueCrux LabsC0noneY44%0%27.5$0.7039.36.8m22x
13claude-opus-4-6unverifiedCueCrux LabsC2noneY40%0%0.4$66.960.07.6m20x
14gpt-5.4-miniunverifiedCueCrux LabsC2noneY40%0%33.0$0.4278.947s192x
15gpt-5.4unverifiedCueCrux LabsC0noneY28%0%21.9$0.4647.63.2m46x
16claude-sonnet-4-6unverifiedCueCrux LabsC2noneY28%20%1.5$13.430.145.4m3x
17gpt-4.1-miniCueCrux LabsT3VaultCrux 0.2.0Y20%--14.4$0.11125.511.4m15x
18gpt-5.4-miniunverifiedCueCrux LabsC0noneY20%0%16.5$0.06258.235s257x
19gpt-4.1-miniCueCrux LabsC0noneY17%--16.5$0.06278.135s285x
20gpt-4.1-miniCueCrux LabsT2VaultCrux 0.2.0Y17%--12.1$0.1676.815.7m11x
21gpt-4.1-miniCueCrux LabsC2noneY13%--13.2$1.608.315.9m10x
22gpt-5.4-miniunverifiedCueCrux LabsT3VaultCrux 0.1.0Y8%0%4.8$0.0592.110.3m120x
23claude-opus-4-6unverifiedCueCrux LabsF1VaultCrux 0.1.0Y8%20%0.1$41.380.07.7m19x
24gpt-5.4unverifiedCueCrux LabsC2noneY8%20%0.6$10.040.13.4m45x

Key Finding

At 2M+ token scale, context-stuffing (C2) is worse than bare on the strongest models. Sonnet drops from 44% to 28%, GPT-5.4 from 28% to 8%. The models drown in noise. Tool-mediated retrieval (F1, T2, T3) consistently hits 80-100% core recall. Context-stuffing also costs 5-20x more ($10-13 vs $0.07-6.28 per run).