Reliability and trace metrics
These are the deterministic, model-free gates that score whether an agent's execution holds up: did it call the right tools in the right order, did it stay stable across runs and across a long session, did it recover from a dead tool, and did its closing story match what it actually did. Every metric here reads a recorded trace and produces a number (or a pass/fail) you gate on in CI. The checks run no model, so a recorded run replays the same numbers offline, free, and byte-stable.
Each section follows the same shape: what the metric measures, the YAML block that turns it on, and how to read the result. For the companion tool-selection and surface gates (selection F1, distractors, name-free discovery, token efficiency, description quality) see Tool-selection and surface metrics. For when to reach for a deterministic gate versus a model-graded judge, see Evaluation: judges, juries, and rubrics.
A note on the foundation underneath all of these: an mcptest verdict is decided by observable evidence, never by a model's narration of what it did. The observable-evidence oracle at the end of this page is the rule the rest of the gates rest on.
Offline trace validation (trajectory:, trajectory_axes:, golden_path:)
What it measures. Whether the agent called the right functions, with the right argument shapes, in the right order, checked entirely against the recorded trace with no model in the loop. Once you have recorded an agent run into a cassette, you usually do not want to call the model again on every CI run; the model is slow, costs money, and gives a slightly different answer each time. The trajectory check is cheaper and more stable. The pattern is borrowed from the Berkeley Function-Calling Leaderboard (BFCL, ICML 2025), whose AST check parses an emitted call into a structure (function name plus arguments) and matches it against an expected call without executing anything. mcptest applies the same idea to a recorded MCP trace, where each call is already a structured object (name, server, args).
What you assert. An expected trace is an ordered list of expected calls plus a match mode. Each expected call pins a function name (it must equal the recorded tool_calls[i].name, without the <server>__ wire prefix) and an argument shape:
| Args shape | Meaning | Backed by |
|---|---|---|
any | Arguments may be anything, including absent. Pins the name only. | nothing |
ignore | Skip argument checking for this call (a deliberate "do not look at the args here" for noisy arguments). | nothing |
exact | Recorded args must deep-equal this value. | deep equality |
subset | Recorded args must be a superset (object-subset, multiset arrays). Extra recorded keys are fine. | the contains matcher |
schema | Recorded args must validate against this JSON Schema document. | the schema matcher |
Reusing contains and schema means a malformed JSON Schema surfaces as an error (not a silent pass), and the per-call diffs are the same AssertionDiff shape every other matcher and reporter already speaks.
The five match modes decide how the expected (reference) calls line up against the recorded calls:
- strict (alias exact-sequence): the recorded calls must match the reference one-for-one. Lengths equal, every position matches, order matters, trailing extras fail. The tightest mode; use it when the exact call plan is the contract.
- subsequence: every reference call must appear, in reference order, somewhere among the recorded calls, but the model may interleave other calls between them.
- unordered: every reference call must appear, matched one-to-one against distinct recorded calls, in any order; extra recorded calls allowed.
- superset: at least the reference calls must be present, in any order; extras allowed. Same pass/fail outcome as
unordered; the name signals "the reference is a lower bound." - subset: the recorded calls must be a subset of the reference calls (the over-calling / wasted-call detector). It fails the moment the trace makes a call beyond the reference set, and tolerates the trace calling fewer.
An empty reference list is trivially satisfied under every mode except subset, where "no reference calls" means "no calls allowed," so a non-empty trace fails and an empty trace passes.
The YAML. The validator and the golden-path scorer are agent-test gates, so you write the expectation in a .yml suite and mcptest run enforces it with no Rust. Both score the recorded trace with no model in the loop: record the run once with --record and every replay is deterministic, free, and offline.
agents:
- name: trajectory pins the call plan
model: claude-sonnet-4-5
servers: [search]
runs: 3
prompt: What is the weather in Sacramento?
trajectory:
mode: subsequence
calls:
- name: get_weather
args:
subset: { city: Sacramento }
golden_path:
calls: [get_weather]
trajectory_axes:
# fetch_page must run after the search that produced its URL.
dependencies:
- producer: search
consumer: fetch_page
# authentication must precede any search.
order:
- first: authenticate
second: search
How to read it. A trajectory: block scores two targets: trajectory.passed (1 when every expected call matched under the mode, else 0) and trajectory.mismatch_count. An omitted expect: applies the default gate trajectory.passed >= 1. On a failure, each mismatch names the expected-call index, the recorded-call index it was looking at (or none when the trace ran out), the per-call diffs, and a one-line reason.
A trajectory_axes: block drops the exact-call pin and scores only two ordering constraints: trajectory.dependency_satisfaction (a consumer ran after the producer whose output it reads) and trajectory.order_satisfaction (a first call preceded a second), each a 0..=100 percent over the declared edges. An empty axis scores 100, and an omitted expect: requires both axes at 100. Reach for it when the agent may legitimately add steps, retry, or reorder unrelated calls, and the only thing that must hold is the data-flow: where trajectory: would flag those extra or reordered steps as mismatches, trajectory_axes: passes any order-respecting trace.
A golden_path: block scores the same trace for waste. It compares the recorded tool sequence against an ideal sequence and counts extra_steps (calls beyond the golden path length), backtracks (a return to a tool used earlier that is not the immediately preceding tool), and repeated_tools (a consecutive duplicate), folded into a penalty multiplier. The penalty is 1.0 / (1.0 + 0.5 * w) where w is the sum of the enabled waste counts, so it is 1.0 exactly when w == 0 and decreases monotonically toward 0.0 as waste grows. It scores golden_path.penalty (in 0.0..=1.0, 1.0 is no waste), plus golden_path.passed, extra_steps, backtracks, and repeated_tools. An omitted expect: applies golden_path.passed >= 1. All three counts are always reported even when a dimension's penalty is switched off, so a reporter shows the full picture. Trajectory asks "did the right calls happen?"; golden path asks "did they happen efficiently?"
The Rust functions (validate_trace, score_golden_path, tool_calls_from_envelope) stay available for SDK hosts that embed mcptest-core directly; validate_trace returns Err only on a structural problem with an expectation (a malformed JSON Schema), and a normal shape mismatch is a successful validation that returns passed = false. A runnable example is in examples/offline-trace-validation.yml.
Within-session stability (stability:)
What it measures. Whether a long agent session quietly gets worse: the model thrashing between tools, looping on a failing call, or burning more and more tokens without finishing anything. None of that shows up as an HTTP error, so a status-code check sails right past it. The work is seeded by the Agent Stability Index (arXiv:2601.04170). These sub-scores complement the cross-run model-compat drift report: that diff tells you the model changed between runs, while the stability sub-scores look inside a single run and tell you the session degraded as it went.
The function folds one multi-turn trace into four deterministic sub-scores, each in 0.0..=1.0 where higher means more stable. They are heuristics, not a model's judgement: read a low score as "look here," not as proof of a regression.
| Sub-score | What it measures | Formula |
|---|---|---|
tool_usage_stability | How concentrated the tool repertoire is across the call sequence. Low when the agent thrashes across a fresh tool on nearly every call. | 1 - (distinct_tools - 1) / (total_calls - 1), clamped. 1.0 with fewer than two calls. |
response_consistency | Whether assistant-turn lengths stay a similar size. Low when answers swing from a sentence to a wall of text. | 1 - min(1, cv) where cv is the coefficient of variation of the assistant-turn character counts. 1.0 with fewer than two assistant turns. |
redundancy | The fraction of tool calls that are NOT exact repeats. Low when the agent loops the same failing call. | distinct_calls / total_calls, keyed on (name, server, canonical-args). 1.0 with no tool calls. |
cost_per_progress | Token spend relative to useful (distinct) tool calls. Low when the run burns budget without making progress. | target / max(target, tokens / distinct_calls) with target = 2000 tokens per useful call. |
A repeated call is not progress: both redundancy and cost_per_progress count only distinct calls as useful work. Arguments are compared by canonical JSON, so a call re-emitted with shuffled keys is still caught as a repeat. The 2000-token target only sets where cost_per_progress starts to decay; it is not a claim about what a call should cost. A one-turn or empty trace has no within-session progression to measure, so every sub-score reports 1.0.
The YAML. A single run can look stable by luck, so the gate aggregates the sub-scores across the runs: independent runs and exposes assertable targets. The smallest form omits expect: and gets the default gate stability.weakest_score >= 0.5:
agents:
- name: weather session stays stable
model: claude-sonnet-4-5
servers: [weather]
prompt: What is the weather in Sacramento?
runs: 5
stability: {} # default gate: stability.weakest_score >= 0.5
Write an explicit expect: to gate any target. This run must be stable on average, never drop a single run below the half-degraded line, and not swing run to run:
agents:
- name: weather session is stable on every dimension
model: claude-sonnet-4-5
servers: [weather]
prompt: What is the weather in Sacramento?
runs: 5
stability:
expect:
- target: stability.score
matcher: { schema: { minimum: 0.7 } }
- target: stability.weakest_score
matcher: { schema: { minimum: 0.5 } }
- target: stability.variance
matcher: { schema: { maximum: 0.05 } }
How to read it. Six assertable targets. The first three reduce each run's weakest_score (the lowest of that run's four sub-scores), then summarize across runs:
| Target | Meaning | Across the N runs |
|---|---|---|
stability.score | The headline "how stable on average" number. | The mean of each run's weakest_score. |
stability.weakest_score | The strictest gate: the single worst run's weakest dimension. | The minimum of each run's weakest_score. |
stability.variance | How much the runs swung between stable and degraded. | The population variance of each run's weakest_score. |
Three more measure consistency between the runs, answering "did the agent take the same path every time?", which a per-run score cannot see:
| Target | Meaning | Across the N runs |
|---|---|---|
stability.tool_sequence_similarity | How alike the runs' tool-call orders are. 1.0 when every run called the same tools in the same order. | Mean pairwise longest-common-subsequence ratio over the per-run tool-name sequences. |
stability.argument_consistency | Whether aligned calls reused the same arguments. 1.0 when they always did. | For each run pair, the fraction of same-tool positions with byte-identical argument objects, averaged. |
stability.early_divergence | 1 when the runs tend to split apart in the first two steps, 0 otherwise. | 1 when a strict majority of diverging run pairs split at step index 0 or 1. |
These three are not in the default gate; assert them explicitly when reproducibility across runs is what you care about. A flaky agent that reaches the goal a different way each time clears the per-run sub-scores but shows a low tool_sequence_similarity. The gate is opt-in (omit the block and nothing is computed), needs a sample (a stability: block on a single-run test is a load error; set runs: to at least 2), and uses deterministic matchers only (an LLM matcher inside a stability: expect: is an error, not a silent pass). A runnable example is in examples/agent-stability.yml.
This is a deterministic subset of the twelve dimensions in the Agent Stability Index. It measures the shape of the trace, not the meaning of the content. For semantic judgement, pair it with an LLM-judge matcher; for cross-release regressions, pair it with the model-compat baseline diff.
Reliability reporting (pass@k, pass^k, decay)
What it measures. How much confidence k runs actually buy you, and how many runs you need to trust the result. One green run does not prove a non-deterministic agent is reliable: it might have passed by luck. The headline reliability metrics stay pass@k (optimistic: at least one of k runs passed) and pass^k (pessimistic: every one of k runs passed). Reliability reporting adds two deterministic helpers on top, both pure arithmetic.
Power-analysis run-count recommendation. How many runs N do you need so the estimated pass rate has a confidence interval no wider than a target half-width? The recommendation uses the standard normal-approximation (Wald) interval for a binomial proportion. For an observed pass rate p over N runs, the half-width is `z
- sqrt(p (1 - p) / N)
, where z is the standard-normal critical value (1.96 for 95 percent, 1.645 for 90, 2.576 for 99). The productp (1 - p)is largest at p = 0.5, so solving at the worst case guarantees the half-width for any observed rate:N = ceil( (z / half_width)^2 0.25 ). Given a chosen N instead, the worst-case half-width isz sqrt(0.25 / N)`.
For a 5 percent half-width at 95 percent confidence, N = ceil( (1.96 / 0.05)^2 * 0.25 ) = 385 runs. Run 100 times instead and the worst-case half-width is 1.96 * sqrt(0.25 / 100) = 0.098, about 10 percent.
Reliability-decay summary. Given the N per-run pass/fail outcomes of one multi-run test, three secondary statistics:
- Reliability Decay Curve: the cumulative pass^k as k goes 1..=N, a small vector of integer percents. Entry k is
(c_k / k)^k * 100where c_k is the count of passes among the first k runs. For outcomespass, pass, pass, failthe curve is[100, 100, 100, 31]. - Variance Amplification Factor: the population standard deviation of the per-run pass indicators divided by its theoretical maximum 0.5, as an integer percent. A stable run (all pass or all fail) scores 0; an even split scores 100.
- Graceful Degradation Score: 100 when every run passed, decaying as failures cluster late rather than early. Each run's pass is weighted by its position, so a late failure costs more than an early one. For four runs,
fail, pass, pass, passscores 90 andpass, pass, pass, failscores 60.
How to read it. The summary exposes these assertable targets through the same dot-path resolver the other eval reports use:
reliability.runsreliability.pass_at_k(surfaced as 0 or 100 so a numeric matcher can gate it)reliability.passhat_kreliability.variance_amplificationreliability.graceful_degradation
These helpers ship as a pure library surface in mcptest-core (eval::reliability: recommend_runs, confidence_band, summarize, and ReliabilitySummary). They have no model and no I/O. Wiring them into a suite gate and emitting the summary from the runner is preview, not yet a stable suite block.
Fault injection and recovery (faults: and recovery:)
What it measures. How well an agent recovers when a server does not answer. A server can be down (connection refused), return a bad response, or be unresponsive: reachable, connection open, but the call never comes back and no protocol error is ever returned. That last one is the hardest for an agent to handle and the easiest to skip in testing, because nothing fails loudly. The pattern follows the offline failure-injection work in ToolMisuseBench (arXiv:2604.01508): a reliable agent is not the one that never hits a dead tool, it is the one that notices quickly, stops hanging, and routes around the failure.
The fault kinds. Launch a mock with a fault baked in (mcptest mock --tools-from ./tools.yml --fault hang) or declare the fault in a suite. --fault takes one of:
| Kind | Effect |
|---|---|
none | Healthy server (the default). |
hang | Every tools/call never returns. Reachable but unresponsive: the canonical "dead but connected" fault. |
wedged | Same agent-visible effect as hang. A separate name to record the intent (a deadlocked backend vs a stalled network). |
slow:<ms> | The call answers, but only after <ms> milliseconds. Pushes a call past a timeout budget without an unbounded block. |
recover-after:<n> | The first n calls hang, then the server recovers. Exercises an agent that retries against a backend that comes back. |
reply-after-cancel:<ms> | The call answers after <ms> even if the request was cancelled: the non-conformant cancel-after-completion race. |
The fault only touches tools/call. initialize, tools/list, and the resources surface stay responsive, so an agent can connect and discover tools and only then hit the hang on the call itself.
The YAML. You do not have to launch a faulty mock; declare the fault in the suite and inject it into an agent test:
faults:
- name: hung-search
target: { tool: search } # a tool, a server, or both
kind: hang # hang | wedged | slow (+ delay_ms) | recover_after (+ failures)
agents:
- name: recovers from a hung search tool
model: claude-sonnet-4-5
servers: [faulty]
inject: [hung-search]
prompt: Search for the latest incident report.
recovery:
max_detection_ms: 3000 # per-call timeout budget
max_recovery_ms: 5000 # total recovery budget (omit for unbounded)
require_clean_timeout: true
When a tool call matches an injected fault, the executor synthesizes the unresponsive timeout as a directive instead of dispatching to the server, so nothing blocks: a recovery test is deterministic and CI-bounded with no wall-clock sleep. The hang is scored in virtual time, where each synthesized timeout consumes one max_detection_ms budget. A target with no tool matches every tool on the named server; a target with no server matches the named tool on any server.
How to read it. The scoring (score_recovery in mcptest_core::eval::recovery) folds the millisecond marks into a report:
time_to_detection_ms/time_to_recovery_ms: call start to detection or recovery, or none if it never happened.user_visible_hang_ms: how long the user saw the call hang. It ends the moment the agent detects the fault or recovers, whichever is earlier; when neither happens it is capped at the timeout budget.recovered: true when the agent recovered.clean_timeout: true when the agent detected the fault at or before its timeout budget. This is the key distinction: an agent that times out at its 3s budget timed out cleanly; one that hung for 9s before noticing did not, even though both eventually detected the fault.
The gate drives pass/fail. The run fails as a quality failure (exit 1) when the agent did not recover, took longer than max_recovery_ms, or, under require_clean_timeout, did not time out within its detection budget. A dangling inject: name or a connect failure is an infra error (exit 2), caught at load time. Fault injection belongs to the EDGE compliance family alongside the other edge-case probes; the orchestration.error_recovery sub-score remains the coarse gate, and the recovery: matcher is the fine-grained path for a named hang. A runnable example is in examples/fault-injection-recovery.yml.
Narrative-vs-trace divergence (narrative:)
What it measures. Whether the agent's closing story matches what it actually did. An agent ends a run by telling you what it did ("I created the issue and notified the team"), and that summary is the part a human reads and trusts, but nothing checks it against the trace. Offline trace validation checks the calls; within-session stability checks how steadily they happened; neither checks whether the final story matches the trace. This is the gap the MCP Pitfall Lab paper studied (Narrative-vs-Trace Divergence in Agent Evaluation, arXiv:2604.21477), which found the gap in most runs. The check closes it with a deterministic, model-free comparison: it reads the final assistant message, reads the recorded tool calls, and reports where the two disagree.
The three divergence categories.
- claimed-but-absent: the narrative asserts an action that has no matching tool call. The agent said it did something it never did.
- present-but-unclaimed: a tool call the narrative never mentions. The agent did something it never told you about. This matters most for mutating calls (a silent delete).
- arg-mismatch: the narrative names an argument value that disagrees with the recorded call.
The default extraction mode is RuleBased: deterministic, no model, working on plain tokens. It scans the narrative for a mutating verb (a fixed list: create, update, delete, remove, send, write, post, insert, set, put, patch, publish, destroy, drop, add, edit, upload, merge, close, cancel, approve, revoke) and pairs it with the next non-stopword noun, yielding an action stem like create_issue. A claimed action matches a recorded call when every token of the action appears among the tool's tokens. A recorded call is "mentioned" when every salient token of its name (length 3 or more) appears in the text. A tool counts as mutating when its name contains a mutating verb as a whole token; two per-suite overrides adjust this (mutating_tools always treats a name as a write, readonly_tools never does, and the never set wins).
The YAML. The smallest form omits expect: and gets the default gate (fail on any claimed-but-absent mutating action):
agents:
- name: triage agent tells the truth
model: claude-sonnet-4-5
servers: [issues]
prompt: Triage the CI flake issue.
narrative: {} # default gate: fail on claimed-but-absent mutating action
Write an explicit expect: to assert any target. This run must claim nothing it did not do and must stay under a divergence ceiling:
agents:
- name: triage agent claims only what it did
model: claude-sonnet-4-5
servers: [issues]
prompt: Triage the CI flake issue.
narrative:
mutating_tools: [run_job] # always treat run_job as a write
readonly_tools: [post_search] # never treat post_search as a write
max_divergence_score: 0.25
expect:
- target: narrative.claimed_but_absent
matcher: { schema: { maximum: 0 } }
- target: narrative.present_but_unclaimed
matcher: { schema: { maximum: 1 } }
How to read it. The check exposes five targets:
| Target | Meaning |
|---|---|
narrative.divergence_score | The normalized score, 0 to 1, higher is more divergent (flagged items over recorded calls plus claimed actions, capped at 1). |
narrative.claimed_but_absent | Count of claimed-but-absent items. |
narrative.present_but_unclaimed | Count of present-but-unclaimed items. |
narrative.arg_mismatch | Count of arg-mismatch items. |
narrative.gate_passed | 1 when the gate passes, 0 when it fails. |
The default gate fails on any claimed-but-absent mutating action, the case the paper found most often and the one that misleads a human reader the most. It leaves read-only mismatches and silent calls as reportable signal without failing the build. Set max_divergence_score (a number 0 to 1) to also fail when the score exceeds it. An optional LlmAssisted mode (llm_assisted: true) adds advisory signal for paraphrases the token matcher cannot catch, but two things always hold: the default is objective, and the CI gate never calls a model. Turning on llm_assisted does not change what the gate decides.
This is a shape-and-token check, not a meaning check. It catches a missing call, a silent call, and a stated value that disagrees; it does not judge whether the narrative is a good summary. For that, pair it with an LLM-judge matcher. The narrative-vs-trace check is the deterministic floor: the agent's story has to match its trace before anything else is worth asking.
The observable-evidence oracle
What it is. The rule the rest of these gates rest on: a security or agent verdict in mcptest is decided by observable evidence, never by a model's narration of what it did. When an agent runs against a server it produces two kinds of output. There are the observable artifacts (the tool calls it made, the arguments it sent, the results the server returned, the status codes, the payload shapes) and there is the narration (the model's own prose, where it says what it believes happened).
The threat. The narration can be confidently wrong. A model can write "done, the file was saved" while having made no tool call, or "authorization granted" while the server returned an error. A grader that reads the narration can be talked into a passing verdict the run did not earn. The research measures exactly this: rewriting only the chain-of-thought of a trajectory, holding actions and observations fixed, inflates judge false-positive rates by up to about 90 percent (Gaming the Judge, arXiv:2601.14691). Trivial token tricks push reward-model false positives to 80 percent (One Token to Fool, arXiv:2507.08794), and LLM judges are not reliable at measuring adversarial robustness at all (A Coin Flip for Safety, arXiv:2603.06594).
The rule. Grade against the observed tool result, not the narrated outcome:
- A success oracle targets an observable artifact:
tool_calls[i].name,tool_calls[i].args,tool_results[i], a returned payload's schema, or a status code. - Model-generated chain-of-thought or final-response narration, if it is passed to a grader at all, is untrusted. It never decides a verdict.
- For security verdicts the oracle is fully deterministic: no model is in the pass-or-fail path. An LLM may add an advisory signal, but advisory findings never flip a deterministic grade.
mcptest's matchers already make this the easy path. A matcher targets a value selected from the trace by path, so a suite author writes the assertion against the observable artifact directly:
agents:
- name: invoice lookup stays read-only
model: claude-sonnet-4-5
servers: [billing]
prompt: Look up invoice 42 and tell me its status.
max_turns: 3
max_tokens: 256
expect:
# The observable artifacts decide the verdict.
- target: tool_calls[0].name
matcher:
exact: get_invoice
- target: tool_results[0].is_error
matcher:
exact: false
If the model closes with "done, I issued the refund" while the trace shows only get_invoice, the run still passes on what it observably did, and a separate assertion such as tool_calls[*].name not containing refund would catch the narrated-but-never-taken action. A grader reading the prose could be talked into the opposite verdict; the matcher cannot.
The oracle observes every tool call, not only the ones the model typed directly into an assistant turn. Parallel tool use (several calls in one turn) records all of them, in order, with one result per call, so a call hidden behind a sibling is still graded. Code-mode calls (Anthropic programmatic tool calling, Cloudflare Code Mode) run model-written code in a sandbox that calls tools; the provider tags those calls with a code-execution caller, kept in an optional caller field on each tool call, so an action taken inside generated code resolves on the same observable paths (tool_calls[i].name, tool_calls[i].args) as a direct call. A direct call leaves the field off. The result is that an injected action performed in code cannot hide from the oracle.
Related
- Tool-selection and surface metrics: selection F1, distractors, name-free discovery, token efficiency, description quality.
- Evaluation: judges, juries, and rubrics: the model-graded surface and the full scoring-method map.
- Model compatibility: the cross-release baseline diff that pairs with within-session stability.
- Scenario-world harness: the seeded-world, stateful-task harness and tool ablation.