Execution safety policy
Some mcptest features execute real tool calls with synthesized arguments: suite scaffolding, assertion proposal, and the probe tier. Against a tool like delete_file or a production SaaS backend, that is an agent autonomously causing side effects. The execution safety policy in mcptest-core::exec_policy is the single layer those features consult before calling anything.
Tool classification
Every tool from tools/list is classified before any call is planned. Explicit MCP tool annotations (the annotations object on the tool descriptor) always win over the name heuristic.
| Source | Condition | Class |
|---|---|---|
| Annotation | readOnlyHint: true | ReadOnly |
| Annotation | destructiveHint: true | Destructive |
| Annotation | idempotentHint: false | Mutating |
| Name heuristic | destructive-looking word | Destructive |
| Name heuristic | mutating-looking word | Mutating |
| Name heuristic | anything else | ReadOnlyPresumed |
ReadOnly and ReadOnlyPresumed are kept distinct so callers can tell "the server declared this read-only" apart from "we presume it is". Among annotations, read-only is checked first (the spec defines the other hints as meaningful only when it is false), then destructive, then non-idempotent. A malformed annotations object (for example "destructiveHint": "yes") is ignored and the name heuristic decides; the description lints flag the malformed object separately.
The name heuristic
Unannotated tool names are split into lowercase words at separators and camelCase boundaries (deleteFile, delete-file, and delete_file all contain the word delete), then matched against two word lists:
- Mutating:
create,update,set,send,write,post,put,insert,patch,add,upload - Destructive:
delete,remove,drop,destroy,purge,erase,wipe,kill,revoke
A destructive word outranks a mutating word in the same name (create_or_delete is Destructive).
What each class means at execution time
ExecutionPolicy::decide maps a class to one of four decisions:
- Execute: ReadOnly and ReadOnlyPresumed tools run freely, including stability double-calls.
- ExecuteOnce: Mutating tools run at most once per run. They are never double-called for stability, because the second call of a non-idempotent tool is itself a side effect. Generated tests for these tools are marked serial.
- GenerateOnly: Destructive tools without the override, in a context that can emit tests instead of running them. The generated test is prefixed with a marker comment beginning
# review before first runso a human looks at it before it ever executes. - Refuse: Destructive tools without the override, in a context that needs a live call (probing). The call is skipped with a typed reason.
Setting execute_destructive (a CLI flag in a later ticket) downgrades Destructive to ExecuteOnce. Even with the override, destructive tools are never double-called.
Policy knobs and defaults
| Knob | Default | Meaning |
|---|---|---|
execute_destructive | false | Allow executing destructive tools. |
max_calls | unlimited | Total tool-call budget for the run. |
concurrency | 2 | Maximum calls in flight at once. |
call_delay | 100ms | Polite pause between HTTP calls. |
The call budget is a thread-safe counter (CallBudget). Every planned call must acquire from it first; under concurrency exactly max_calls acquisitions succeed and the rest fail with a typed BudgetExhausted error carrying the limit, so a run can stop scheduling cleanly.
The delay applies between consecutive HTTP calls; callers decide whether the target transport is HTTP. Stdio targets may ignore it.
Example
The policy reads the tool descriptors a server returns from tools/list, so the way to steer it is the annotations object on each tool. These two tools classify in opposite directions:
mock_server:
name: records
tools:
# readOnlyHint wins over the name heuristic, so this runs freely
# (stability double-calls included): ReadOnly -> Execute.
- name: search_records
description: "Find records matching a query."
annotations:
readOnlyHint: true
response:
content:
- type: text
text: "0 records"
# No annotation, and the name contains "delete", so the heuristic
# classifies it Destructive: GenerateOnly when a feature can emit a
# test instead of running it, Refuse when a live call is required.
- name: delete_record
description: "Delete a record by id."
response:
content:
- type: text
text: "deleted"
Serve it with mcptest mock --tools-from records.yaml and point a synthesizing feature (scaffolding, proposal, or the probe tier) at it: the read-only tool is exercised, the destructive one is held back behind the # review before first run marker or skipped with a typed reason.
Data cleanup is your responsibility
mcptest never cleans up after a mutating or destructive test. If a generated or probed call creates a record, sends a message, or uploads a file, removing that data afterwards is the developer's responsibility. Run synthesized suites against disposable or staging targets, not production.