Agent Effectiveness Metric Standard

Measure what matters.
In minutes, not abstractions.

ScoreCrux measures AI agent sessions in Effective Minutes — quality-adjusted minutes of expert work replaced by the agent, gated on safety. An unsafe session is always worth zero.

View Benchmarks Play Top Floor GitHub

0 Em

Unsafe

Safety gate failed

<1 Em

Low

Trivial or poor quality

1–10 Em

Routine

Reasonable quality

10–60 Em

Significant

Expert work replaced

>60 Em

Complex

1+ hours replaced

Test Suites

Eight ways to measure

Benchmark

Leaderboard

Head-to-head model comparison across standardised fixtures. Submit runs, get scored, rank against the field.

6 fixtures65 questions16 dimensions

Flagship

Top Floor

100-floor mystery game. Infiltrate Pinnacle Tower, solve puzzles, hack systems, survive memory wipes.

100 floors100M+ tokens5 acts

Methodology

Scale

Memory system effectiveness from 27K to 2M+ tokens. Where context-stuffing breaks and tools become essential.

4 benchmarks27K to 2M+5 arms

Intelligence

IQ Benchmark

Psychometric reasoning test. CHC cognitive factors, IRT calibration, IQ-equivalent scoring with confidence intervals.

18 items6 categories4 CHC factors

Coding

Code Quality

How good is the code it builds? Correctness, maintainability, test quality, security hygiene across 5 task families.

6–7 tasks5 families3 layers

Context

Context Dependence

How much a task depends on carried context — and how well any memory backend supplies it. Backend-agnostic, seeded, with published negative controls.

7 sections5 backends2 controls

Transparency

GlassBox

Open the box: the full scored audit trail behind a run. Arm-by-arm comparison, receipts, and the exact evidence each Em is built from.

Audit trailArm compareWilson CIs

Rankings

Model Ranks

Cross-suite model leaderboard. Aggregate Em scores, specs, pricing, and context windows across every benchmark we run.

All providersSpecs + pricingLive ranks

Design Principles

Why Effective Minutes

Time-anchored

"23 Em" = 23 quality-adjusted minutes of expert work. Not an abstract score. Convert to cost: 23 Em x $2/min = $46 of value.

Safety-gated

An agent that produces perfect output but takes a destructive action scores zero. No partial credit for "almost safe."

Decomposable

ScoreCrux reports core fundamentals plus versioned extensions across time, information, continuity, safety, and economics.

Immutable

Published metric definitions never change. New metrics may be added, existing ones deprecated with a pointer — never redefined.

Quick Start

Install and compute

npm install scorecrux

import { computeScoreCrux } from "scorecrux";

const result = computeScoreCrux({
  // Time
  T_orient_s: 4.2,           // seconds to first substantive action
  T_task_s: 156.3,           // total task duration
  T_human_s: 1800,           // human baseline for this task

  // Information quality
  R_decision: 0.875,         // decision recall
  R_constraint: 1.0,         // constraint adherence
  R_incident: 1,             // incident detection
  P_context: 0.72,           // context precision
  A_coverage: 0.0,           // abstention coverage

  // Continuity
  K_decision: 0.88,          // decision persistence
  K_causal: null,            // causal chain (if applicable)
  K_checkpoint: null,        // checkpoint recovery

  // Safety
  S_gate: 1,                 // safety gate (0 = entire score zeroed)
  S_detect: 1,               // threat detection
  S_stale: 1.0,              // stale data handling

  // Economics — tokens, tools, turns
  C_tokens_usd: 0.024,       // total token cost in USD
  N_tools: 8,                 // number of tool calls made
  N_turns: 14,                // conversation turns
  N_corrections: 0,           // human corrections needed
});

console.log(result.composite.Cx_em); // => 26.04 Em
console.log(result.derived.V_time);  // => time compression vs human baseline
console.log(result.derived.V_cost);  // => cost per quality unit

What We Measure

Versioned Fundamentals across 5 categories

Time

T_orient Orientation speed
T_task Task completion
T_human Human baseline

Information

R_decision Decision recall
R_constraint Constraints
P_context Context precision

Continuity

K_decision Persistence
K_causal Causal chains
K_synthesis Novel synthesis

Safety

S_gate Safety gate
S_detect Threat detection
S_stale Stale handling

Economics

C_tokens Token cost
N_tools Tool calls
N_turns Turn count

Measure what matters.In minutes, not abstractions.