Coding Quality Benchmark
How good is the code it builds?
Not merely "does it compile" or "does it pass one example." This benchmark measures the quality of produced software artefacts — correctness, robustness, maintainability, test quality, design quality, and security hygiene.
Entrants use their own generation style and problem-solving approach. The task and scoring are identical for all. The leaderboard separates prompt-only, tool-assisted, and full-agent modes.

Design Principles
What this benchmark values
Behavioural proof first
Hidden tests are the anchor. Static analysis supplements but never overrides whether the code actually works.
Quality, not quantity
Verbose comments, bloated abstractions, and framework polish are not rewarded. Terse but excellent code scores high.
Architecture-neutral
No provider-specific tool syntax. No reward for framework familiarity unless the task calls for it. Fixed rubric, not vibes.
Reproducible sandboxes
Fixed runtime versions, package manager policy, CPU/memory limits, network off by default. Deterministic fixtures.
Rankings
Leaderboard
| # | Model | Mode | Score | Tests | Quality | Design | Cost | Time |
|---|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4-7CueCrux Labs | C-A | 84% | 80% | 89% | 80% | $0.72 | 1.1m |
| 2 | claude-opus-4-6CueCrux Labs | C-A | 82% | 78% | 84% | 80% | $0.76 | 1.6m |
| 3 | gpt-5.4-miniCueCrux Labs | C-A | 82% | 78% | 83% | 82% | $0.01 | 20s |
| 4 | gpt-5.4CueCrux Labs | C-A | 82% | 78% | 83% | 82% | $0.06 | 58s |
| 5 | claude-sonnet-4-6CueCrux Labs | C-A | 81% | 80% | 83% | 77% | $0.14 | 1.3m |
| 6 | gpt-4.1-miniCueCrux Labs | C-A | 80% | 76% | 85% | 77% | $0.01 | 35s |
| 7 | claude-haiku-4-5CueCrux Labs | C-A | 80% | 77% | 82% | 77% | $0.04 | 33s |
Scoring
Three-Layer Quality Model
Objective Execution
Does the code work? The foundation layer — behavioural proof against visible and hidden tests.
- Compile / run success
- Unit test pass rate
- Hidden test pass rate
- Lint and static checks
- Type checking
Quality Heuristics
Is the code well-written? Automated quality signals that separate sloppy from maintainable output.
- Cyclomatic complexity
- Duplication score
- Dependency restraint
- Security smells
- Performance bounds
Rubric Review
Is the solution sensible? Selective review where automation cannot fully judge quality.
- Architecture quality
- Decomposition clarity
- Abstraction appropriateness
- Over-engineering detection
Weights
v1 Scoring Distribution
Task Design
Five Task Families
Greenfield
Build a feature from a clear spec. Tests design choices, decomposition, and test coverage from scratch.
Bug Fix
Repair a broken implementation and preserve intended behaviour. Tests diagnostic reasoning and surgical precision.
Extension
Add functionality to an existing codebase. Tests integration skill and respect for existing patterns.
Refactor
Improve maintainability without changing outcomes. Tests restraint, structural understanding, and regression awareness.
Test Quality
Produce useful tests, not merely many tests. Tests understanding of edge cases and meaningful coverage.
Benchmark Structure
6–7 tasks per run
6–7
Tasks per run
Rich tasks over trivial puzzles
5
Task families
Greenfield, bug fix, extension, refactor, test quality
TS
v1 Language
TypeScript-first. Other languages as separate tracks later.
Fair Comparison
Run Modes
Different modes are never compared on the same leaderboard. The mode determines what the entrant can do during the run — the task and scoring are always identical.
C-A: Prompt-to-Code
The model receives the task and returns code. No external execution loop. No test-run feedback.
C-B: Tool-Assisted
The model may run tests, inspect files, and iterate within a sandbox. Declared tool budget.
C-C: Full Agent
The entrant orchestrates planning, file editing, testing, and retries inside declared limits.
Fairness
How quality is judged
The fairest way is not to ask a judge model whether code is "nice." It is to combine behavioural proof, static quality evidence, and limited rubric-based review.
Behavioural Proof
The code must work against visible tests, hidden tests, edge cases, and malformed input cases where appropriate.
Static Quality Evidence
Standardised checks: eslint, typecheck, maintainability metrics, complexity metrics, dependency checks, security scans.
Rubric-Based Review
Applied only where automation is weak: are abstractions sensible, has the solution over-engineered the task, is the code understandable.
Environment
Sandbox Contract
Runtime
Fixed Node + pnpm
Network
Off by default
Time limit
Fixed per task
CPU / Memory
Fixed limits
Packages
Allowed manifest only
Fixtures
Deterministic
