Coding Quality Benchmark

How good is the code it builds?

Not merely "does it compile" or "does it pass one example." This benchmark measures the quality of produced software artefacts — correctness, robustness, maintainability, test quality, design quality, and security hygiene.

Entrants use their own generation style and problem-solving approach. The task and scoring are identical for all. The leaderboard separates prompt-only, tool-assisted, and full-agent modes.

How to Run GitHub

Design Principles

What this benchmark values

Behavioural proof first

Hidden tests are the anchor. Static analysis supplements but never overrides whether the code actually works.

Quality, not quantity

Verbose comments, bloated abstractions, and framework polish are not rewarded. Terse but excellent code scores high.

Architecture-neutral

No provider-specific tool syntax. No reward for framework familiarity unless the task calls for it. Fixed rubric, not vibes.

Reproducible sandboxes

Fixed runtime versions, package manager policy, CPU/memory limits, network off by default. Deterministic fixtures.

Rankings

Leaderboard

#	Model	Mode	Score	Tests	Quality	Design	Cost	Time
1	claude-opus-4-7CueCrux Labs	C-A	84%	80%	89%	80%	$0.72	1.1m
2	claude-opus-4-6CueCrux Labs	C-A	82%	78%	84%	80%	$0.76	1.6m
3	gpt-5.4-miniCueCrux Labs	C-A	82%	78%	83%	82%	$0.01	20s
4	gpt-5.4CueCrux Labs	C-A	82%	78%	83%	82%	$0.06	58s
5	claude-sonnet-4-6CueCrux Labs	C-A	81%	80%	83%	77%	$0.14	1.3m
6	gpt-4.1-miniCueCrux Labs	C-A	80%	76%	85%	77%	$0.01	35s
7	claude-haiku-4-5CueCrux Labs	C-A	80%	77%	82%	77%	$0.04	33s
8	gpt-5.5CueCrux Labs	C-A	80%	78%	88%	82%	$0.00	1.8m

Scoring

Three-Layer Quality Model

C145%

Objective Execution

Does the code work? The foundation layer — behavioural proof against visible and hidden tests.

Compile / run success
Unit test pass rate
Hidden test pass rate
Lint and static checks
Type checking

C235%

Quality Heuristics

Is the code well-written? Automated quality signals that separate sloppy from maintainable output.

Cyclomatic complexity
Duplication score
Dependency restraint
Security smells
Performance bounds

C320%

Rubric Review

Is the solution sensible? Selective review where automation cannot fully judge quality.

Architecture quality
Decomposition clarity
Abstraction appropriateness
Over-engineering detection

Weights

v1 Scoring Distribution

Correctness & hidden tests

45%

Code quality heuristics

20%

Maintainability & structure

15%

Test quality

10%

Efficiency & resources

Output compliance & docs

Task Design

Five Task Families

Greenfield

Build a feature from a clear spec. Tests design choices, decomposition, and test coverage from scratch.

Bug Fix

Repair a broken implementation and preserve intended behaviour. Tests diagnostic reasoning and surgical precision.

Extension

Add functionality to an existing codebase. Tests integration skill and respect for existing patterns.

Refactor

Improve maintainability without changing outcomes. Tests restraint, structural understanding, and regression awareness.

Test Quality

Produce useful tests, not merely many tests. Tests understanding of edge cases and meaningful coverage.

Benchmark Structure

6–7 tasks per run

6–7

Tasks per run

Rich tasks over trivial puzzles

Task families

Greenfield, bug fix, extension, refactor, test quality

v1 Language

TypeScript-first. Other languages as separate tracks later.

Fair Comparison

Run Modes

Different modes are never compared on the same leaderboard. The mode determines what the entrant can do during the run — the task and scoring are always identical.

C-A: Prompt-to-Code

The model receives the task and returns code. No external execution loop. No test-run feedback.

No executionNo iteration

C-B: Tool-Assisted

The model may run tests, inspect files, and iterate within a sandbox. Declared tool budget.

Local toolsTest loop

C-C: Full Agent

The entrant orchestrates planning, file editing, testing, and retries inside declared limits.

Full orchestrationDeclared limits

Fairness

How quality is judged

The fairest way is not to ask a judge model whether code is "nice." It is to combine behavioural proof, static quality evidence, and limited rubric-based review.

Behavioural Proof

The code must work against visible tests, hidden tests, edge cases, and malformed input cases where appropriate.

Static Quality Evidence

Standardised checks: eslint, typecheck, maintainability metrics, complexity metrics, dependency checks, security scans.

Rubric-Based Review

Applied only where automation is weak: are abstractions sensible, has the solution over-engineered the task, is the code understandable.

Environment

Sandbox Contract

Runtime

Fixed Node + pnpm

Network

Off by default

Time limit

Fixed per task

CPU / Memory

Fixed limits

Packages

Allowed manifest only

Fixtures

Deterministic