mcptest docs GitHub

Execution safety policy

Some mcptest features execute real tool calls with synthesized arguments: suite scaffolding, assertion proposal, and the probe tier. Against a tool like delete_file or a production SaaS backend, that is an agent autonomously causing side effects. The execution safety policy in mcptest-core::exec_policy is the single layer those features consult before calling anything.

Tool classification

Every tool from tools/list is classified before any call is planned. Explicit MCP tool annotations (the annotations object on the tool descriptor) always win over the name heuristic.

SourceConditionClass
AnnotationreadOnlyHint: trueReadOnly
AnnotationdestructiveHint: trueDestructive
AnnotationidempotentHint: falseMutating
Name heuristicdestructive-looking wordDestructive
Name heuristicmutating-looking wordMutating
Name heuristicanything elseReadOnlyPresumed

ReadOnly and ReadOnlyPresumed are kept distinct so callers can tell "the server declared this read-only" apart from "we presume it is". Among annotations, read-only is checked first (the spec defines the other hints as meaningful only when it is false), then destructive, then non-idempotent. A malformed annotations object (for example "destructiveHint": "yes") is ignored and the name heuristic decides; the description lints flag the malformed object separately.

The name heuristic

Unannotated tool names are split into lowercase words at separators and camelCase boundaries (deleteFile, delete-file, and delete_file all contain the word delete), then matched against two word lists:

A destructive word outranks a mutating word in the same name (create_or_delete is Destructive).

What each class means at execution time

ExecutionPolicy::decide maps a class to one of four decisions:

Setting execute_destructive (a CLI flag in a later ticket) downgrades Destructive to ExecuteOnce. Even with the override, destructive tools are never double-called.

Policy knobs and defaults

KnobDefaultMeaning
execute_destructivefalseAllow executing destructive tools.
max_callsunlimitedTotal tool-call budget for the run.
concurrency2Maximum calls in flight at once.
call_delay100msPolite pause between HTTP calls.

The call budget is a thread-safe counter (CallBudget). Every planned call must acquire from it first; under concurrency exactly max_calls acquisitions succeed and the rest fail with a typed BudgetExhausted error carrying the limit, so a run can stop scheduling cleanly.

The delay applies between consecutive HTTP calls; callers decide whether the target transport is HTTP. Stdio targets may ignore it.

Example

The policy reads the tool descriptors a server returns from tools/list, so the way to steer it is the annotations object on each tool. These two tools classify in opposite directions:

mock_server:
  name: records
  tools:
    # readOnlyHint wins over the name heuristic, so this runs freely
    # (stability double-calls included): ReadOnly -> Execute.
    - name: search_records
      description: "Find records matching a query."
      annotations:
        readOnlyHint: true
      response:
        content:
          - type: text
            text: "0 records"
    # No annotation, and the name contains "delete", so the heuristic
    # classifies it Destructive: GenerateOnly when a feature can emit a
    # test instead of running it, Refuse when a live call is required.
    - name: delete_record
      description: "Delete a record by id."
      response:
        content:
          - type: text
            text: "deleted"

Serve it with mcptest mock --tools-from records.yaml and point a synthesizing feature (scaffolding, proposal, or the probe tier) at it: the read-only tool is exercised, the destructive one is held back behind the # review before first run marker or skipped with a typed reason.

Data cleanup is your responsibility

mcptest never cleans up after a mutating or destructive test. If a generated or probed call creates a record, sends a message, or uploads a file, removing that data afterwards is the developer's responsibility. Run synthesized suites against disposable or staging targets, not production.