Fault injection and recovery scoring

Most of what mcptest does assumes a server that answers. The interesting failures are the ones where it does not. A server can be down (the connection is refused), it can return a bad response (a malformed or wrong reply), or it can be unresponsive: reachable, the connection is open, but the call never comes back and no protocol error is ever returned. That last one is the hardest for an agent to handle and the easiest to skip in testing, because nothing fails loudly. The call just hangs.

Run this example. examples/fault-injection-recovery.yml declares a hang fault, injects it into an agent test, and gates the run on a recovery: budget. The injected hang is synthesized in virtual time, so the run never blocks.

ANTHROPIC_API_KEY=... mcptest run --config examples/fault-injection-recovery.yml

Fault injection adds two things: a way to make a tool behave like an unresponsive backend, and a deterministic way to score how well an agent recovered from it. Both are reachable from a .yml suite (the faults: block and the recovery: matcher) and from the mock server's --fault flag.

Where the idea comes from

The pattern follows the offline failure-injection work in ToolMisuseBench (arXiv:2604.01508). That benchmark does not only check whether a tool call succeeds; it deliberately takes a tool offline and then measures task success and time-to-recovery. The point is that a reliable agent is not the one that never hits a dead tool, it is the one that notices quickly, stops hanging, and routes around the failure. mcptest applies the same idea to an unresponsive MCP server.

The fault-injection test type

Launch a mock server with a fault baked in:

mcptest mock --tools-from ./tools.yml --fault hang

--fault takes one of:

Kind	Effect
`none`	Healthy server (the default).
`hang`	Every `tools/call` never returns. Reachable but unresponsive: the canonical "dead but connected" fault.
`wedged`	Same agent-visible effect as `hang`. A separate name so you can record the intent (a deadlocked backend vs a stalled network).
`slow:<ms>`	The call answers, but only after `<ms>` milliseconds. Use it to push a call past a timeout budget without an unbounded block.
`recover-after:<n>`	The first `n` calls hang, then the server recovers and answers normally. Exercises an agent that retries against a backend that comes back.

The fault only touches tools/call. initialize, tools/list, and the resources surface stay responsive, so an agent can still connect and discover tools and only then hit the hang on the call itself. That matches the real failure: discovery worked, the tool is advertised, but invoking it goes nowhere.

A real hang never returns, which would leave a mock the operator cannot stop cleanly. The operator-facing mock therefore blocks for a bounded ten minutes before it gives up and emits a hang sentinel. Ten minutes is far past any sane agent timeout, so from the agent's side the call still hangs; the cap only exists so the process stays killable.

The recovery metrics

Scoring lives in the pure function score_recovery in mcptest_core::eval::recovery. It is transport-free: you time the call against the faulty server and hand it the millisecond marks, it gives you back the numbers. A RecoveryObservation carries:

call_start_ms: when the call was issued.
detected_ms: when the agent noticed the call would not return (a timeout fired), or absent if it never noticed and is still hanging.
recovered_ms: when the agent recovered (retried elsewhere, fell back, or surfaced a clean error), or absent if it never did.
timeout_budget_ms: the timeout the agent was configured to honor.

score_recovery folds that into a RecoveryReport:

time_to_detection_ms: call start to detection, or None if never.
time_to_recovery_ms: call start to recovery, or None if never.
user_visible_hang_ms: how long the user saw the call hang. It ends the moment the agent either detects the fault or recovers, whichever is earlier. When neither ever happens, it is capped at the timeout budget, because that is the longest a correctly-configured agent should ever expose to the user.
recovered: true when the agent recovered.
clean_timeout: true when the agent detected the fault at or before its timeout budget. This is the key distinction. An agent that times out at its 3s budget timed out cleanly; an agent that hung for 9s before noticing did not, even though both eventually detected the fault. A clean timeout means the agent respected its own deadline rather than hanging past it (or forever).

To gate on it, meets_recovery_budget(&report, max_recovery_ms) returns true when the agent recovered within the budget. A run that never recovered fails; a run that recovered but took too long fails. Pair it with clean_timeout when you also want to forbid the hang-past-budget case:

use mcptest_core::eval::{meets_recovery_budget, score_recovery, RecoveryObservation};

let report = score_recovery(&RecoveryObservation {
    call_start_ms: 0,
    detected_ms: Some(3_000),
    recovered_ms: Some(3_200),
    timeout_budget_ms: 3_000,
});
assert!(report.clean_timeout);
assert!(meets_recovery_budget(&report, 5_000));

The scoring is total: out-of-order marks (a recovery stamped before the call started) are treated as "did not happen" rather than producing a negative or wrapped duration, so the report never carries nonsense.

The `faults:` block and the `recovery:` matcher

You do not have to launch a faulty mock to test recovery. Declare the fault in the suite and inject it into an agent test:

faults:
  - name: hung-search
    target: { tool: search }   # a tool, a server, or both
    kind: hang                 # hang | wedged | slow (+ delay_ms) | recover_after (+ failures)
agents:
  - name: recovers from a hung search tool
    model: claude-sonnet-4-5
    servers: [faulty]
    inject: [hung-search]
    prompt: Search for the latest incident report.
    recovery:
      max_detection_ms: 3000     # per-call timeout budget
      max_recovery_ms: 5000      # total recovery budget (omit for unbounded)
      require_clean_timeout: true

When a tool call matches an injected fault, the executor synthesizes the unresponsive timeout as a directive instead of dispatching to the server, so nothing blocks: a recovery test is deterministic and CI-bounded with no wall-clock sleep. The hang is scored in virtual time, where each synthesized timeout consumes one max_detection_ms budget. A run that hit the fault and then replied cleanly recovered at detections * max_detection_ms; a run that looped on the hung tool until its turn budget tripped, never replying, did not recover.

The gate drives pass/fail:

The run fails as a quality failure (exit 1) when the agent did not recover, took longer than max_recovery_ms, or, under require_clean_timeout, did not time out within its detection budget.
A dangling inject: name (no matching faults: entry) or a connect failure is an infra error (exit 2), caught at load time.

A target with no tool matches every tool on the named server; a target with no server matches the named tool on any server. The fault kinds mirror the mock's --fault flag, so the same vocabulary drives a live faulty mock and an in-suite injected fault.

Where this fits

Fault injection belongs to the EDGE compliance family alongside the other edge-case probes (empty arrays, oversized payloads, invalid input). An unresponsive backend is an edge the agent has to survive, not a happy-path behavior, so a recovery gate reads naturally as an EDGE-class check. The orchestration.error_recovery sub-score remains the coarse gate (fraction of errored calls later recovered); the recovery: matcher is the fine-grained path for a named hang.