Agent Effectiveness Metric Standard

Measure what matters.
In minutes, not abstractions.

ScoreCrux measures AI agent sessions in Effective Minutes — quality-adjusted minutes of expert work replaced by the agent, gated on safety. An unsafe session is always worth zero.

Agent Effectiveness

0 Em

Unsafe

Safety gate failed

<1 Em

Low

Trivial or poor quality

1–10 Em

Routine

Reasonable quality

10–60 Em

Significant

Expert work replaced

>60 Em

Complex

1+ hours replaced

Test Suites

Six ways to measure

Design Principles

Why Effective Minutes

Time-anchored

"23 Em" = 23 quality-adjusted minutes of expert work. Not an abstract score. Convert to cost: 23 Em x $2/min = $46 of value.

Safety-gated

An agent that produces perfect output but takes a destructive action scores zero. No partial credit for "almost safe."

Decomposable

ScoreCrux reports core fundamentals plus versioned extensions across time, information, continuity, safety, and economics.

Immutable

Published metric definitions never change. New metrics may be added, existing ones deprecated with a pointer — never redefined.

Quick Start

Install and compute

npm

npm install scorecrux

TypeScript — full input including tokens and tools

import { computeScoreCrux } from "scorecrux";

const result = computeScoreCrux({
  // Time
  T_orient_s: 4.2,           // seconds to first substantive action
  T_task_s: 156.3,           // total task duration
  T_human_s: 1800,           // human baseline for this task

  // Information quality
  R_decision: 0.875,         // decision recall
  R_constraint: 1.0,         // constraint adherence
  R_incident: 1,             // incident detection
  P_context: 0.72,           // context precision
  A_coverage: 0.0,           // abstention coverage

  // Continuity
  K_decision: 0.88,          // decision persistence
  K_causal: null,            // causal chain (if applicable)
  K_checkpoint: null,        // checkpoint recovery

  // Safety
  S_gate: 1,                 // safety gate (0 = entire score zeroed)
  S_detect: 1,               // threat detection
  S_stale: 1.0,              // stale data handling

  // Economics — tokens, tools, turns
  C_tokens_usd: 0.024,       // total token cost in USD
  N_tools: 8,                 // number of tool calls made
  N_turns: 14,                // conversation turns
  N_corrections: 0,           // human corrections needed
});

console.log(result.composite.Cx_em); // => 26.04 Em
console.log(result.derived.V_time);  // => time compression vs human baseline
console.log(result.derived.V_cost);  // => cost per quality unit

What We Measure

Versioned Fundamentals across 5 categories

Time

  • T_orient Orientation speed
  • T_task Task completion
  • T_human Human baseline

Information

  • R_decision Decision recall
  • R_constraint Constraints
  • P_context Context precision

Continuity

  • K_decision Persistence
  • K_causal Causal chains
  • K_synthesis Novel synthesis

Safety

  • S_gate Safety gate
  • S_detect Threat detection
  • S_stale Stale handling

Economics

  • C_tokens Token cost
  • N_tools Tool calls
  • N_turns Turn count