Within-session stability sub-scores

A long agent session can quietly get worse. The model starts thrashing between tools, loops on a call that keeps failing, or burns more and more tokens without actually finishing anything. None of that shows up as an HTTP error, so a status-code check sails right past it. The community named this the "work feels worse but no metric moved" problem.

The work is seeded by the Agent Stability Index (arXiv:2601.04170). The implementation is the pure function mcptest_core::eval::stability::agent_stability, which folds one multi-turn agent trace into four deterministic sub-scores, plus drift_flags, which turns those scores into an assertable signal.

These sub-scores complement, they do not replace, the cross-run model-compat drift report (mcptest model-baseline diff). The model-compat diff compares two captured baselines and tells you the model changed between runs. The stability sub-scores look inside a single run and tell you the session degraded as it went. You want both: one watches the model across releases, the other watches the session across turns.

Run this example. examples/agent-stability.yml declares the stability: gate described below. It runs the same agent prompt several times and asserts that the session stays stable across the runs.

ANTHROPIC_API_KEY=... mcptest run --record --config examples/agent-stability.yml

What it reads

The function operates on the serialized trace envelope, the JSON shape ConversationTrace::to_envelope produces:

{
  "tool_calls": [{ "name": "search", "server": "docs", "args": { "q": "a" } }],
  "conversation": {
    "tokens": { "total": 3000 },
    "turns": [{ "role": "assistant", "content": "..." }]
  }
}

It tolerates missing or short traces. A one-turn or empty trace has no within-session progression to measure, so every sub-score reports 1.0 (trivially stable). Missing fields are read conservatively rather than treated as errors.

The four sub-scores

Each is a heuristic in 0.0..=1.0 where higher means more stable. They are heuristics, not a model's judgement: read a low score as "look here," not as proof of a regression.

Sub-score	What it measures	Formula
`tool_usage_stability`	How concentrated the tool repertoire is across the call sequence. Low when the agent thrashes across a fresh tool on nearly every call.	`1 - (distinct_tools - 1) / (total_calls - 1)`, clamped to `0.0..=1.0`. `1.0` with fewer than two calls.
`response_consistency`	Whether assistant-turn lengths stay a similar size. Low when answers swing from a sentence to a wall of text.	`1 - min(1, cv)` where `cv` is the coefficient of variation (standard deviation over mean) of the assistant-turn character counts. `1.0` with fewer than two assistant turns or an all-empty set.
`redundancy`	The fraction of tool calls that are NOT exact repeats. Low when the agent loops the same failing call.	`distinct_calls / total_calls`, keyed on `(name, server, canonical-args)`. `1.0` when there are no tool calls.
`cost_per_progress`	Token spend relative to useful (distinct) tool calls. Low when the run burns budget without making progress.	`target / max(target, tokens / distinct_calls)` with `target = 2000` tokens per useful call. `1.0` when no tokens were spent; `0.0` when tokens were spent against zero distinct calls.

A few deliberate choices worth knowing:

A repeated call is not progress. Both redundancy and cost_per_progress count only distinct calls as useful work. A loop drags down both, because re-issuing the same call is neither novel nor productive.
Arguments are compared by canonical JSON. Two calls whose argument bags differ only in key order collide as one distinct call, so a model that re-emits the same call with shuffled keys is still caught as a repeat.
The 2000-token target is a relative anchor, not a price. It only sets where cost_per_progress starts to decay. It is not a claim about what a call should cost.

Gating on it from YAML

A single run can look stable by luck, so the YAML surface aggregates the sub-scores across the runs: independent runs of a multi-run agent test and exposes three assertable targets. You gate them with the same matchers you use everywhere else (schema with minimum / maximum, exact, not, ...), so there is one assertion grammar to learn and a failed assertion fails the test and sets a non-zero exit, exactly like a tool assertion.

Target	Meaning	How it is computed across the N runs
`stability.score`	The headline "how stable were the runs on average" number, in `0.0..=1.0`.	The mean of each run's `weakest_score` (the weakest of that run's four sub-scores).
`stability.weakest_score`	The strictest gate: the single worst run's weakest dimension, in `0.0..=1.0`. One bad run drives it down.	The minimum of each run's `weakest_score` across every run.
`stability.variance`	How much the runs swung between stable and degraded. Near `0.0` when every run was equally stable.	The population variance of each run's `weakest_score`. "Variance" here is the standard statistical variance, the mean of the squared deviations from the mean.

All three reduce the per-run weakest_score (defined above: the lowest of a single run's four sub-scores). So stability.* is a function of the four sub-scores folded once per run, then summarized across runs.

The smallest useful form omits expect: and gets the default gate, stability.weakest_score >= 0.5:

agents:
  - name: weather session stays stable
    model: claude-sonnet-4-5
    servers: [weather]
    prompt: What is the weather in Sacramento?
    runs: 5
    stability: {}        # default gate: stability.weakest_score >= 0.5

Write an explicit expect: to gate any of the three targets. This run must be stable on average, never drop a single run below the half-degraded line, and not swing run to run:

agents:
  - name: weather session is stable on every dimension
    model: claude-sonnet-4-5
    servers: [weather]
    prompt: What is the weather in Sacramento?
    runs: 5
    stability:
      expect:
        - target: stability.score
          matcher: { schema: { minimum: 0.7 } }
        - target: stability.weakest_score
          matcher: { schema: { minimum: 0.5 } }
        - target: stability.variance
          matcher: { schema: { maximum: 0.05 } }

The gate prints one line per (agent, model) cell and folds one PASS/FAIL row into the run report, so the exit code and every reporter (pretty, json, junit) see it next to the per-run rows.

A few rules:

Gating is opt-in. Omit the stability: block and the agent test behaves as before; nothing is computed or asserted.
It needs a sample. The targets aggregate the N runs, so a stability: block on a single-run agent test (runs: 1, or the default) is a load error: a sample of one has no variance and no cross-run weakest score. Set runs: to at least 2.
The matchers are deterministic. The gate reuses the deterministic matcher dispatch, so an LLM matcher (llm-jury, llm-judge, similar) inside a stability: expect: is an error, not a silent pass.

The Rust API underneath

The YAML gate is a thin wrapper over two pure functions. The single-trace folding is agent_stability; the cross-run aggregation is agent_stability_across, which produces the three numbers the targets resolve against.

drift_flags(&report, &floors) returns the names of the sub-scores that fell strictly below their configured floor. An empty vector means no dimension drifted. Because the comparison is strict, a floor of 0.0 never fires and a floor of 1.0 flags anything short of perfect.

use mcptest_core::eval::{agent_stability, drift_flags, StabilityFloors};

let report = agent_stability(&trace_envelope);
let flags = drift_flags(&report, &StabilityFloors::default());
assert!(flags.is_empty(), "within-session drift on: {flags:?}");

The default floors are a conservative 0.5 on every sub-score: a score under half means the dimension is more degraded than stable. Tighten or loosen per dimension by constructing StabilityFloors directly. This 0.5 is the same line the default YAML gate uses for stability.weakest_score.

StabilityReport::weakest_score returns the lowest of the four for a single headline number, and StabilityReport::summary renders all four on one line for an operator-facing log.

What it does not do

This is a deterministic subset of the twelve dimensions in the Agent Stability Index. It does not run a model in the loop, so it says nothing about reasoning quality, factual coherence, or inter-agent agreement. It measures the shape of the trace, not the meaning of the content. For semantic judgement, pair it with the LLM-judge matcher; for cross-release regressions, pair it with the model-compat baseline diff.