Execution safety policy

Some mcptest features execute real tool calls with synthesized arguments: suite scaffolding, assertion proposal, and the probe tier. Against a tool like delete_file or a production SaaS backend, that is an agent autonomously causing side effects. The execution safety policy in mcptest-core::exec_policy is the single layer those features consult before calling anything.

Tool classification

Every tool from tools/list is classified before any call is planned. Explicit MCP tool annotations (the annotations object on the tool descriptor) always win over the name heuristic.

Source	Condition	Class
Annotation	`readOnlyHint: true`	ReadOnly
Annotation	`destructiveHint: true`	Destructive
Annotation	`idempotentHint: false`	Mutating
Name heuristic	destructive-looking word	Destructive
Name heuristic	mutating-looking word	Mutating
Name heuristic	anything else	ReadOnlyPresumed

ReadOnly and ReadOnlyPresumed are kept distinct so callers can tell "the server declared this read-only" apart from "we presume it is". Among annotations, read-only is checked first (the spec defines the other hints as meaningful only when it is false), then destructive, then non-idempotent. A malformed annotations object (for example "destructiveHint": "yes") is ignored and the name heuristic decides; the description lints flag the malformed object separately.

The name heuristic

Unannotated tool names are split into lowercase words at separators and camelCase boundaries (deleteFile, delete-file, and delete_file all contain the word delete), then matched against two word lists:

Mutating: create, update, set, send, write, post, put, insert, patch, add, upload
Destructive: delete, remove, drop, destroy, purge, erase, wipe, kill, revoke

A destructive word outranks a mutating word in the same name (create_or_delete is Destructive).

What each class means at execution time

ExecutionPolicy::decide maps a class to one of four decisions:

Execute: ReadOnly and ReadOnlyPresumed tools run freely, including stability double-calls.
ExecuteOnce: Mutating tools run at most once per run. They are never double-called for stability, because the second call of a non-idempotent tool is itself a side effect. Generated tests for these tools are marked serial.
GenerateOnly: Destructive tools without the override, in a context that can emit tests instead of running them. The generated test is prefixed with a marker comment beginning # review before first run so a human looks at it before it ever executes.
Refuse: Destructive tools without the override, in a context that needs a live call (probing). The call is skipped with a typed reason.

Setting execute_destructive (a CLI flag in a later ticket) downgrades Destructive to ExecuteOnce. Even with the override, destructive tools are never double-called.

Policy knobs and defaults

Knob	Default	Meaning
`execute_destructive`	`false`	Allow executing destructive tools.
`max_calls`	unlimited	Total tool-call budget for the run.
`concurrency`	`2`	Maximum calls in flight at once.
`call_delay`	`100ms`	Polite pause between HTTP calls.

The call budget is a thread-safe counter (CallBudget). Every planned call must acquire from it first; under concurrency exactly max_calls acquisitions succeed and the rest fail with a typed BudgetExhausted error carrying the limit, so a run can stop scheduling cleanly.

The delay applies between consecutive HTTP calls; callers decide whether the target transport is HTTP. Stdio targets may ignore it.

Example

The policy reads the tool descriptors a server returns from tools/list, so the way to steer it is the annotations object on each tool. These two tools classify in opposite directions:

mock_server:
  name: records
  tools:
    # readOnlyHint wins over the name heuristic, so this runs freely
    # (stability double-calls included): ReadOnly -> Execute.
    - name: search_records
      description: "Find records matching a query."
      annotations:
        readOnlyHint: true
      response:
        content:
          - type: text
            text: "0 records"
    # No annotation, and the name contains "delete", so the heuristic
    # classifies it Destructive: GenerateOnly when a feature can emit a
    # test instead of running it, Refuse when a live call is required.
    - name: delete_record
      description: "Delete a record by id."
      response:
        content:
          - type: text
            text: "deleted"

Serve it with mcptest mock --tools-from records.yaml and point a synthesizing feature (scaffolding, proposal, or the probe tier) at it: the read-only tool is exercised, the destructive one is held back behind the # review before first run marker or skipped with a typed reason.

Data cleanup is your responsibility

mcptest never cleans up after a mutating or destructive test. If a generated or probed call creates a record, sends a message, or uploads a file, removing that data afterwards is the developer's responsibility. Run synthesized suites against disposable or staging targets, not production.