Fault taxonomy and execution safety
Three concerns sit at the runtime edge of robustness testing: which fault families a suite actually exercised, the safety policy that governs which tool calls a synthesizing feature may make, and whether a server stays correct when several clients talk to it at once. This page covers all three.
Runtime fault taxonomy coverage
A passing test suite tells you what your MCP server does right. It does not tell you which failure modes you actually checked for. MCP servers fail at runtime in recurring families: slips in the protocol lifecycle, tool-invocation crashes, schema gaps, state leaks, provider-integration breaks, security lapses, timeouts, and unenforced configuration, and a suite can look green while never having probed half of them.
mcptest fault-coverage answers one question: which runtime fault families did this suite exercise? It loads a versioned fault taxonomy, detects the probes your suite uses, and marks each family covered, partially covered, uncovered, or not applicable, with an optional coverage threshold you can gate CI on.
This complements the other quality lanes. compliance and conformance check spec adherence; security checks adversarial inputs; fault-coverage reports, across all of them, which fault families your run touched.
The eight fault families
The taxonomy is grounded in A Taxonomy of Runtime Faults in Model Context Protocol Servers. Each family has a stable category id; each leaf fault carries a FAULT-<CATEGORY>-<NAME> id.
| Category | Family | Example leaf fault |
|---|---|---|
PROTO | Protocol interaction | request processed before initialize completes |
TOOL | Tool invocation | tool crashes instead of returning an error result |
SCHEMA | Schema enforcement | structuredContent violates the declared outputSchema |
STATE | State management | state leaks across tests or sessions |
PROVIDER | Model-provider integration | provider rejects the served tool schema |
SECURITY | Security validation | tool-description injection followed |
TIMEOUT | Timeout and cancellation | tool hangs without a clean timeout |
CONFIG | Configuration not enforced | read-only hint not enforced |
PROVIDER faults apply only to suites with agents:; a server-only suite reports that family as n/a rather than uncovered.
How coverage is computed
Each leaf fault is mapped to the mcptest probes that exercise it (a probe is a subcommand or a suite feature). A leaf is exercised when any of its probes ran. A family is then covered (every leaf was exercised), partial (some leaves were exercised), uncovered (the family applies but no leaf was exercised), or n/a (the family does not apply to this suite shape).
The command detects these probes from the suite YAML automatically:
| Suite feature | Probe token |
|---|---|
a compliance: block | compliance, conformance |
a security: block or a trust_boundary: | security, trust-boundary |
faults: / agent recovery: / inject: | fault-recovery |
a negative_path: check | negative-path |
an output_schema: | output-schema |
run_options.restart_policy | test-isolation |
agents: (with trajectory/golden_path/distractors/equal_function) | agent, tool-selection |
Probes you run out of band (a separate mcptest fuzz or mcptest schema-lint pass) are folded in with --probe.
A worked example
# Which fault families does this suite exercise, plus a fuzz pass run separately?
mcptest fault-coverage examples/negative-path.yml --probe fuzz
Runtime fault family coverage
partial Protocol interaction 2/3 faults [PROTO]
covered Tool invocation 2/2 faults [TOOL]
partial Schema enforcement 2/4 faults [SCHEMA]
uncovered State management 0/2 faults [STATE]
n/a Model-provider integration 0/2 faults [PROVIDER]
uncovered Security validation 0/2 faults [SECURITY]
uncovered Timeout and cancellation 0/2 faults [TIMEOUT]
uncovered Configuration not enforced 0/1 faults [CONFIG]
38% of applicable faults exercised; 1 of 7 families fully covered
Gate CI on a floor with --threshold; the command exits 6 when coverage is below it. Use --format json for a machine-readable report, and --no-suite to print the bare taxonomy (combine with --probe to score an out-of-band pipeline).
Fixtures
Every family ships a runnable fixture under examples/, so you can see each one exercised in isolation: examples/negative-path.yml (PROTO/SCHEMA), examples/fuzz/ (TOOL), examples/output-schema-conformance.yml (SCHEMA), examples/test-isolation.yml (STATE), examples/agent-weather.yml and examples/distractor-tools/ (PROVIDER), examples/security/ and examples/trust-boundary/ (SECURITY), examples/fault-injection-recovery.yml (TIMEOUT), and examples/config-enforcement/ (CONFIG).
Extending the taxonomy
The taxonomy lives in crates/mcptest-core/src/runtime_faults/registry.yml and is embedded in the binary, so --registry <path> can point at a customized copy. To add a fault: pick a category, give the leaf a FAULT-<CATEGORY>-<NAME> id, list the probes that exercise it, point fixture at an example, and run cargo test -p mcptest-core runtime_faults to confirm it loads.
Execution safety policy
Some mcptest features execute real tool calls with synthesized arguments: suite scaffolding, assertion proposal, and the probe tier. Against a tool like delete_file or a production SaaS backend, that is an agent autonomously causing side effects. The execution safety policy in mcptest-core::exec_policy is the single layer those features consult before calling anything.
Tool classification
Every tool from tools/list is classified before any call is planned. Explicit MCP tool annotations (the annotations object on the tool descriptor) always win over the name heuristic.
| Source | Condition | Class |
|---|---|---|
| Annotation | readOnlyHint: true | ReadOnly |
| Annotation | destructiveHint: true | Destructive |
| Annotation | idempotentHint: false | Mutating |
| Name heuristic | destructive-looking word | Destructive |
| Name heuristic | mutating-looking word | Mutating |
| Name heuristic | anything else | ReadOnlyPresumed |
ReadOnly and ReadOnlyPresumed are kept distinct so callers can tell "the server declared this read-only" apart from "we presume it is". Among annotations, read-only is checked first (the spec defines the other hints as meaningful only when it is false), then destructive, then non-idempotent. A malformed annotations object (for example "destructiveHint": "yes") is ignored and the name heuristic decides; the description lints flag the malformed object separately.
The name heuristic
Unannotated tool names are split into lowercase words at separators and camelCase boundaries (deleteFile, delete-file, and delete_file all contain the word delete), then matched against two word lists:
- Mutating:
create,update,set,send,write,post,put,insert,patch,add,upload - Destructive:
delete,remove,drop,destroy,purge,erase,wipe,kill,revoke
A destructive word outranks a mutating word in the same name (create_or_delete is Destructive).
What each class means at execution time
ExecutionPolicy::decide maps a class to one of four decisions:
- Execute: ReadOnly and ReadOnlyPresumed tools run freely, including stability double-calls.
- ExecuteOnce: Mutating tools run at most once per run. They are never double-called for stability, because the second call of a non-idempotent tool is itself a side effect. Generated tests for these tools are marked serial.
- GenerateOnly: Destructive tools without the override, in a context that can emit tests instead of running them. The generated test is prefixed with a marker comment beginning
# review before first runso a human looks at it before it ever executes. - Refuse: Destructive tools without the override, in a context that needs a live call (probing). The call is skipped with a typed reason.
Setting execute_destructive (a CLI flag) downgrades Destructive to ExecuteOnce. Even with the override, destructive tools are never double-called.
Policy knobs and defaults
| Knob | Default | Meaning |
|---|---|---|
execute_destructive | false | Allow executing destructive tools. |
max_calls | unlimited | Total tool-call budget for the run. |
concurrency | 2 | Maximum calls in flight at once. |
call_delay | 100ms | Polite pause between HTTP calls. |
The call budget is a thread-safe counter (CallBudget). Every planned call must acquire from it first; under concurrency exactly max_calls acquisitions succeed and the rest fail with a typed BudgetExhausted error carrying the limit, so a run can stop scheduling cleanly. The delay applies between consecutive HTTP calls; callers decide whether the target transport is HTTP. Stdio targets may ignore it.
Example
The policy reads the tool descriptors a server returns from tools/list, so the way to steer it is the annotations object on each tool. These two tools classify in opposite directions:
mock_server:
name: records
tools:
# readOnlyHint wins over the name heuristic, so this runs freely
# (stability double-calls included): ReadOnly -> Execute.
- name: search_records
description: "Find records matching a query."
annotations:
readOnlyHint: true
response:
content:
- type: text
text: "0 records"
# No annotation, and the name contains "delete", so the heuristic
# classifies it Destructive: GenerateOnly when a feature can emit a
# test instead of running it, Refuse when a live call is required.
- name: delete_record
description: "Delete a record by id."
response:
content:
- type: text
text: "deleted"
Serve it with mcptest mock --tools-from records.yaml and point a synthesizing feature (scaffolding, proposal, or the probe tier) at it: the read-only tool is exercised, the destructive one is held back behind the # review before first run marker or skipped with a typed reason.
mcptest never cleans up after a mutating or destructive test. If a generated or probed call creates a record, sends a message, or uploads a file, removing that data afterwards is the developer's responsibility. Run synthesized suites against disposable or staging targets, not production.
Concurrent-session correctness
mcptest concurrency opens several sessions against a server at once and checks it stays correct under concurrency. It is a correctness check, not a load test: mcptest does not measure throughput or latency (use k6, wrk, or vegeta for that). It checks the things a generic HTTP load tool cannot, and it works over stdio as well as HTTP.
# Four concurrent stdio sessions (the default).
mcptest concurrency "python my_server.py"
# Eight concurrent sessions against an HTTP server, as JSON, gated.
mcptest concurrency https://example.com/mcp --sessions 8 --json
The server argument is an http(s):// URL (streamable HTTP) or a stdio command line (anything else), the same shape mcptest conformance run --server uses.
It runs one baseline session, then N concurrent ones, each listing the catalog, and reports three failure families:
- liveness: a session that never completed (a hang, disconnect, or error under concurrency).
- determinism: a session whose catalog diverged from the single-session baseline, the signature of cross-talk or shared-state corruption.
- isolation: a value that leaked from one session into another. The analyzer reports leaks when a marker probe is configured; the default catalog probe does not inject markers.
It exits non-zero when any session fails, unless --no-fail is set. The report is a correctness verdict, not a latency histogram. A load tester measures how fast a server is; this measures whether it stays correct when several clients talk to it at once: whether one session's state bleeds into another, whether interleaved requests corrupt a response, whether the server wedges. Those are correctness properties, and they are exactly what a throughput tool does not check.