Coding Quality Benchmark

How good is the code it builds?

Not merely "does it compile" or "does it pass one example." This benchmark measures the quality of produced software artefacts — correctness, robustness, maintainability, test quality, design quality, and security hygiene.

Entrants use their own generation style and problem-solving approach. The task and scoring are identical for all. The leaderboard separates prompt-only, tool-assisted, and full-agent modes.

Coding Quality Benchmark

Design Principles

What this benchmark values

Behavioural proof first

Hidden tests are the anchor. Static analysis supplements but never overrides whether the code actually works.

Quality, not quantity

Verbose comments, bloated abstractions, and framework polish are not rewarded. Terse but excellent code scores high.

Architecture-neutral

No provider-specific tool syntax. No reward for framework familiarity unless the task calls for it. Fixed rubric, not vibes.

Reproducible sandboxes

Fixed runtime versions, package manager policy, CPU/memory limits, network off by default. Deterministic fixtures.

Rankings

Leaderboard

#ModelModeScoreTestsQualityDesignCostTime
1claude-opus-4-7CueCrux LabsC-A84%80%89%80%$0.721.1m
2claude-opus-4-6CueCrux LabsC-A82%78%84%80%$0.761.6m
3gpt-5.4-miniCueCrux LabsC-A82%78%83%82%$0.0120s
4gpt-5.4CueCrux LabsC-A82%78%83%82%$0.0658s
5claude-sonnet-4-6CueCrux LabsC-A81%80%83%77%$0.141.3m
6gpt-4.1-miniCueCrux LabsC-A80%76%85%77%$0.0135s
7claude-haiku-4-5CueCrux LabsC-A80%77%82%77%$0.0433s

Scoring

Three-Layer Quality Model

C145%

Objective Execution

Does the code work? The foundation layer — behavioural proof against visible and hidden tests.

  • Compile / run success
  • Unit test pass rate
  • Hidden test pass rate
  • Lint and static checks
  • Type checking
C235%

Quality Heuristics

Is the code well-written? Automated quality signals that separate sloppy from maintainable output.

  • Cyclomatic complexity
  • Duplication score
  • Dependency restraint
  • Security smells
  • Performance bounds
C320%

Rubric Review

Is the solution sensible? Selective review where automation cannot fully judge quality.

  • Architecture quality
  • Decomposition clarity
  • Abstraction appropriateness
  • Over-engineering detection

Weights

v1 Scoring Distribution

Correctness & hidden tests
45%
Code quality heuristics
20%
Maintainability & structure
15%
Test quality
10%
Efficiency & resources
5%
Output compliance & docs
5%

Task Design

Five Task Families

1

Greenfield

Build a feature from a clear spec. Tests design choices, decomposition, and test coverage from scratch.

2

Bug Fix

Repair a broken implementation and preserve intended behaviour. Tests diagnostic reasoning and surgical precision.

3

Extension

Add functionality to an existing codebase. Tests integration skill and respect for existing patterns.

4

Refactor

Improve maintainability without changing outcomes. Tests restraint, structural understanding, and regression awareness.

5

Test Quality

Produce useful tests, not merely many tests. Tests understanding of edge cases and meaningful coverage.

Benchmark Structure

6–7 tasks per run

6–7

Tasks per run

Rich tasks over trivial puzzles

5

Task families

Greenfield, bug fix, extension, refactor, test quality

TS

v1 Language

TypeScript-first. Other languages as separate tracks later.

Fair Comparison

Run Modes

Different modes are never compared on the same leaderboard. The mode determines what the entrant can do during the run — the task and scoring are always identical.

C-A: Prompt-to-Code

The model receives the task and returns code. No external execution loop. No test-run feedback.

No executionNo iteration

C-B: Tool-Assisted

The model may run tests, inspect files, and iterate within a sandbox. Declared tool budget.

Local toolsTest loop

C-C: Full Agent

The entrant orchestrates planning, file editing, testing, and retries inside declared limits.

Full orchestrationDeclared limits

Fairness

How quality is judged

The fairest way is not to ask a judge model whether code is "nice." It is to combine behavioural proof, static quality evidence, and limited rubric-based review.

Behavioural Proof

The code must work against visible tests, hidden tests, edge cases, and malformed input cases where appropriate.

Static Quality Evidence

Standardised checks: eslint, typecheck, maintainability metrics, complexity metrics, dependency checks, security scans.

Rubric-Based Review

Applied only where automation is weak: are abstractions sensible, has the solution over-engineered the task, is the code understandable.

Environment

Sandbox Contract

Runtime

Fixed Node + pnpm

Network

Off by default

Time limit

Fixed per task

CPU / Memory

Fixed limits

Packages

Allowed manifest only

Fixtures

Deterministic