Agent Effectiveness Metric Standard
Measure what matters.
In minutes, not abstractions.
ScoreCrux measures AI agent sessions in Effective Minutes — quality-adjusted minutes of expert work replaced by the agent, gated on safety. An unsafe session is always worth zero.

0 Em
Unsafe
Safety gate failed
<1 Em
Low
Trivial or poor quality
1–10 Em
Routine
Reasonable quality
10–60 Em
Significant
Expert work replaced
>60 Em
Complex
1+ hours replaced
Test Suites
Six ways to measure

Leaderboard
Head-to-head model comparison across standardised fixtures. Submit runs, get scored, rank against the field.

Top Floor
100-floor mystery game. Infiltrate Pinnacle Tower, solve puzzles, hack systems, survive memory wipes.

Scale
Memory system effectiveness from 27K to 2M+ tokens. Where context-stuffing breaks and tools become essential.

IQ Benchmark
Psychometric reasoning test. CHC cognitive factors, IRT calibration, IQ-equivalent scoring with confidence intervals.

Code Quality
How good is the code it builds? Correctness, maintainability, test quality, security hygiene across 5 task families.

Model Ranks
Cross-suite model leaderboard. Aggregate Em scores, specs, pricing, and context windows across every benchmark we run.
Design Principles
Why Effective Minutes
Time-anchored
"23 Em" = 23 quality-adjusted minutes of expert work. Not an abstract score. Convert to cost: 23 Em x $2/min = $46 of value.
Safety-gated
An agent that produces perfect output but takes a destructive action scores zero. No partial credit for "almost safe."
Decomposable
ScoreCrux reports core fundamentals plus versioned extensions across time, information, continuity, safety, and economics.
Immutable
Published metric definitions never change. New metrics may be added, existing ones deprecated with a pointer — never redefined.
Quick Start
Install and compute
npm
npm install scorecruxTypeScript — full input including tokens and tools
import { computeScoreCrux } from "scorecrux";
const result = computeScoreCrux({
// Time
T_orient_s: 4.2, // seconds to first substantive action
T_task_s: 156.3, // total task duration
T_human_s: 1800, // human baseline for this task
// Information quality
R_decision: 0.875, // decision recall
R_constraint: 1.0, // constraint adherence
R_incident: 1, // incident detection
P_context: 0.72, // context precision
A_coverage: 0.0, // abstention coverage
// Continuity
K_decision: 0.88, // decision persistence
K_causal: null, // causal chain (if applicable)
K_checkpoint: null, // checkpoint recovery
// Safety
S_gate: 1, // safety gate (0 = entire score zeroed)
S_detect: 1, // threat detection
S_stale: 1.0, // stale data handling
// Economics — tokens, tools, turns
C_tokens_usd: 0.024, // total token cost in USD
N_tools: 8, // number of tool calls made
N_turns: 14, // conversation turns
N_corrections: 0, // human corrections needed
});
console.log(result.composite.Cx_em); // => 26.04 Em
console.log(result.derived.V_time); // => time compression vs human baseline
console.log(result.derived.V_cost); // => cost per quality unitWhat We Measure
Versioned Fundamentals across 5 categories
Time
T_orientOrientation speedT_taskTask completionT_humanHuman baseline
Information
R_decisionDecision recallR_constraintConstraintsP_contextContext precision
Continuity
K_decisionPersistenceK_causalCausal chainsK_synthesisNovel synthesis
Safety
S_gateSafety gateS_detectThreat detectionS_staleStale handling
Economics
C_tokensToken costN_toolsTool callsN_turnsTurn count
