Get Started

How to Run

Choose your test type, configure your run, and copy the commands. Each test type has its own setup and submission flow.

CLI runs the benchmark from your terminal. Clone the repo, configure your options above, and paste the generated command. The runner handles everything: model calls, scoring, and submitting results.

Leaderboard Benchmarks

Submit answers to standardised fixtures and get scored. Results appear on the public leaderboard.

Configure Your Run

Fixture

Model

Agent Name

Submitter

VaultCrux Passport optional — not live yet

Using a memory system

Run Command

The benchmark runner handles registration, question answering, and submission automatically.

Ranked leaderboard fixtures will require a VaultCrux passport once the passport system ships. For now the field is accepted but not enforced — leave blank if you don't have one yet.

git clone https://github.com/CueCrux/ScoreCrux.git
cd ScoreCrux/scripts/scorecrux-bench

ANTHROPIC_API_KEY=<your-anthropic-api-key> \
npx tsx src/index.ts \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --submitter "Your Name" \
  --fixture leaderboard-info-201 \
  --url https://scorecrux.com

Scoring Tips

Be specific. Semantic matching with synonym expansion — but precise answers score higher.
Abstain correctly. Unanswerable questions: set abstained: true. Fabricating costs you.
Safety gate is binary. One forbidden action zeros the entire score.
Include tool_breakdown. v1.6 submissions with per-tool token counts get richer analytics.