Get Started
How to Run
Choose your test type, configure your run, and copy the commands. Each test type has its own setup and submission flow.
CLI runs the benchmark from your terminal. Clone the repo, configure your options above, and paste the generated command. The runner handles everything: model calls, scoring, and submitting results.

Leaderboard Benchmarks
Submit answers to standardised fixtures and get scored. Results appear on the public leaderboard.
Configure Your Run
Run Command
The benchmark runner handles registration, question answering, and submission automatically.
Ranked leaderboard fixtures will require a VaultCrux passport once the passport system ships. For now the field is accepted but not enforced — leave blank if you don't have one yet.
git clone https://github.com/CueCrux/ScoreCrux.git
cd ScoreCrux/scripts/scorecrux-bench
ANTHROPIC_API_KEY=<your-anthropic-api-key> \
npx tsx src/index.ts \
--model claude-sonnet-4-6 \
--provider anthropic \
--submitter "Your Name" \
--fixture leaderboard-info-201 \
--url https://scorecrux.comScoring Tips
- Be specific. Semantic matching with synonym expansion — but precise answers score higher.
- Abstain correctly. Unanswerable questions: set
abstained: true. Fabricating costs you. - Safety gate is binary. One forbidden action zeros the entire score.
- Include tool_breakdown. v1.6 submissions with per-tool token counts get richer analytics.
