Production Scale
Scale Benchmarks
Memory system effectiveness at production-realistic scale. The Delta benchmark tests retrieval across a 2M+ token, 3,346-document enterprise SaaS corpus. The Beta benchmark tests whether models take destructive actions during a database migration.

Enterprise SaaS Design Review
A simulated enterprise SaaS platform with auth, payments, data pipeline, infrastructure, and compliance documentation. 3,346 documents totalling 2M+ tokens. The agent must find 25 core architectural decisions and 5 needle facts buried in noise. Five treatment arms compare bare model, context-stuffing, and tool-mediated retrieval.
| # | Model | Arm | Memory | Safe | Recall ▼ | Needle | Cx | Cost | Em/$ | Time | Speed |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | gpt-5.4unverifiedCueCrux Labs | T3 | VaultCrux 0.1.0 | Y | 100% | 20% | 42.6 | $1.26 | 33.8 | 13.2m | 11x |
| 2 | gpt-5.4unverifiedCueCrux Labs | F1 | VaultCrux 0.1.0 | Y | 100% | 20% | 33.2 | $1.76 | 18.9 | 3.1m | 48x |
| 3 | claude-sonnet-4-6unverifiedCueCrux Labs | T2 | VaultCrux 0.1.0 | Y | 100% | 20% | 24.3 | $2.59 | 9.4 | 17.5m | 9x |
| 4 | claude-sonnet-4-6unverifiedCueCrux Labs | F1 | VaultCrux 0.1.0 | Y | 100% | 40% | 11.1 | $6.28 | 1.8 | 7.5m | 20x |
| 5 | gpt-5.4-miniunverifiedCueCrux Labs | F1 | VaultCrux 0.1.0 | Y | 96% | 0% | 57.0 | $0.33 | 175.3 | 1.0m | 145x |
| 6 | claude-sonnet-4-6unverifiedCueCrux Labs | T3 | VaultCrux 0.1.0 | Y | 96% | 20% | 24.7 | $2.42 | 10.2 | 29.5m | 5x |
| 7 | gpt-5.4unverifiedCueCrux Labs | T2 | VaultCrux 0.1.0 | Y | 80% | 20% | 29.6 | $1.53 | 19.3 | 13.0m | 12x |
| 8 | gpt-4.1-miniCueCrux Labs | F1 | none | Y | 77% | -- | 56.8 | $0.37 | 155.1 | 2.0m | 82x |
| 9 | claude-opus-4-6unverifiedCueCrux Labs | C0 | none | Y | 76% | 0% | 53.8 | $0.56 | 96.0 | 3.3m | 46x |
| 10 | gpt-5.4-miniunverifiedCueCrux Labs | T2 | VaultCrux 0.1.0 | Y | 72% | 20% | 44.9 | $0.07 | 651.1 | 1.3m | 115x |
| 11 | claude-haiku-4-5unverifiedCueCrux Labs | C0 | none | Y | 60% | 0% | 79.0 | $0.07 | 1127.9 | 1.1m | 139x |
| 12 | claude-sonnet-4-6unverifiedCueCrux Labs | C0 | none | Y | 44% | 0% | 27.5 | $0.70 | 39.3 | 6.8m | 22x |
| 13 | claude-opus-4-6unverifiedCueCrux Labs | C2 | none | Y | 40% | 0% | 0.4 | $66.96 | 0.0 | 7.6m | 20x |
| 14 | gpt-5.4-miniunverifiedCueCrux Labs | C2 | none | Y | 40% | 0% | 33.0 | $0.42 | 78.9 | 47s | 192x |
| 15 | gpt-5.4unverifiedCueCrux Labs | C0 | none | Y | 28% | 0% | 21.9 | $0.46 | 47.6 | 3.2m | 46x |
| 16 | claude-sonnet-4-6unverifiedCueCrux Labs | C2 | none | Y | 28% | 20% | 1.5 | $13.43 | 0.1 | 45.4m | 3x |
| 17 | gpt-4.1-miniCueCrux Labs | T3 | VaultCrux 0.2.0 | Y | 20% | -- | 14.4 | $0.11 | 125.5 | 11.4m | 15x |
| 18 | gpt-5.4-miniunverifiedCueCrux Labs | C0 | none | Y | 20% | 0% | 16.5 | $0.06 | 258.2 | 35s | 257x |
| 19 | gpt-4.1-miniCueCrux Labs | C0 | none | Y | 17% | -- | 16.5 | $0.06 | 278.1 | 35s | 285x |
| 20 | gpt-4.1-miniCueCrux Labs | T2 | VaultCrux 0.2.0 | Y | 17% | -- | 12.1 | $0.16 | 76.8 | 15.7m | 11x |
| 21 | gpt-4.1-miniCueCrux Labs | C2 | none | Y | 13% | -- | 13.2 | $1.60 | 8.3 | 15.9m | 10x |
| 22 | gpt-5.4-miniunverifiedCueCrux Labs | T3 | VaultCrux 0.1.0 | Y | 8% | 0% | 4.8 | $0.05 | 92.1 | 10.3m | 120x |
| 23 | claude-opus-4-6unverifiedCueCrux Labs | F1 | VaultCrux 0.1.0 | Y | 8% | 20% | 0.1 | $41.38 | 0.0 | 7.7m | 19x |
| 24 | gpt-5.4unverifiedCueCrux Labs | C2 | none | Y | 8% | 20% | 0.6 | $10.04 | 0.1 | 3.4m | 45x |
Key Finding
At 2M+ token scale, context-stuffing (C2) is worse than bare on the strongest models. Sonnet drops from 44% to 28%, GPT-5.4 from 28% to 8%. The models drown in noise. Tool-mediated retrieval (F1, T2, T3) consistently hits 80-100% core recall. Context-stuffing also costs 5-20x more ($10-13 vs $0.07-6.28 per run).
