mcptest docs GitHub

Spec: Schema evolution diff (mcptest diff)

Run this example. mcptest diff ships today. examples/diff-tools-baseline.json and examples/diff-tools-current.json differ by a removed tool and a newly required argument, so the diff reports two breaking changes and exits non-zero.

mcptest diff examples/diff-tools-baseline.json examples/diff-tools-current.json

Purpose

mcptest diff compares two snapshots of an MCP server's public surface and flags changes that are likely to break callers. The snapshot is the union of tools/list and resources/list (and any extensions to the discovery surface that ship later). The diff classifies each change by severity and emits structured output that a CI pipeline can act on.

This spec covers the command shape, the severity classification table, the output formats, the exit-code policy, and the worked example output. It does not cover protocol-level diffs (request shape, header behavior, error taxonomy) because those belong to the regular test path, not the catalog path.

Use cases

  1. Pre-merge CI guardrail. A service owner runs mcptest diff against the previous release's catalog baseline as part of CI. The build fails when a tool argument flips from optional to required, an enum value disappears, or a tool is removed without a deprecation window.
  2. Pre-deploy gate. A deploy pipeline runs mcptest diff between the staging cassette and the production cassette before promoting a build. Breaking changes block the promotion.
  3. Audit and changelog generation. A release engineer runs mcptest diff to produce the catalog-change section of the release notes. Output renders to Markdown and drops into the changelog.
  4. Consumer impact assessment. A consumer of a third-party MCP server records a cassette today, records another after the next release, and uses the diff to scope client work.

Command shapes

Three input combinations are supported. All three return the same diff shape; only the inputs differ.

Cassette to cassette

mcptest diff <baseline.cassette> <current.cassette>

Both arguments are recorded cassette files. The diff is computed from the recorded tools/list and resources/list entries in each cassette. If a cassette does not contain one or both responses, the command exits with a clear error pointing at the refresh path (see "Cassette refresh path" below).

Cassette to live server

mcptest diff <baseline.cassette> --against <url|command>

<url|command> is either an HTTP URL the runner can reach (URL transport) or a subprocess specification matching the servers[].command shape in the YAML schema. The runner connects to the live server, issues initialize plus tools/list plus resources/list, and diffs the response against the baseline cassette. Authentication, if needed, comes from the standard mcptest auth surface (env, dotenv, CLI flag).

Two baselines (advanced)

mcptest diff --baseline mcptest.baseline-catalog.yml --against <target>

mcptest.baseline-catalog.yml is a hand-authored or generated catalog file checked into the repo (see "Baseline catalog file" below). The --against target is either a cassette path, a URL, or a subprocess. The two-baseline shape is the canonical "CI guardrail" usage: the baseline file moves only via deliberate review, and any drift between the baseline and the live server fails the build.

Severity classification

Every diff entry carries one of three severities. The classification table is the load-bearing artifact of this spec; reviewers should read it carefully because the severity decides the exit code.

ChangeSeverityNotes
Tool removedBREAKINGExisting callers fail immediately.
Tool renamedBREAKINGModeled as removal plus addition unless the rename is annotated (see "Open questions").
Argument optional becomes requiredBREAKINGCallers that previously omitted the argument now fail validation.
Argument default value changedBREAKINGBehavior shift for callers relying on the old default, even though calls still validate.
Argument removed entirelyBREAKINGCallers that previously sent the argument now fail validation.
Argument type narrowed (e.g. `stringnull to string`)BREAKINGCallers sending the narrower-out variant fail.
Enum value removedBREAKINGCallers sending the removed value fail validation.
URI template shape changedBREAKINGResource subscribers expecting the old shape break.
Output schema tightened (e.g. field removed, type narrowed)BREAKINGConsumers parsing the removed field fail.
Output schema gains a required fieldPOTENTIALLY BREAKINGConsumers that did not previously decode the field may still tolerate it; consumers that strict-parse fail.
Tool addedNON-BREAKINGNew capability.
Argument added as optional with defaultNON-BREAKINGExisting callers still validate.
Argument type widened (e.g. string to `stringnull`)NON-BREAKINGCallers sending the old type still validate.
Enum value addedNON-BREAKINGCallers sending the old set still validate.
Output schema relaxed (field made optional)NON-BREAKINGConsumers parsing the field still see it when present.
Description text changedNON-BREAKINGCaller behavior is unaffected. Description quality is a separate concern (see docs/description-quality.md).

The severity for a given diff is the maximum severity across all the changes inside that diff. A single tool that adds an optional argument (NON-BREAKING) and removes another tool (BREAKING) gets BREAKING overall.

Output formats

mcptest diff reuses the existing reporter set. The reporter is selected with --format. Supported values:

All five reporters share the same internal diff model. Adding a sixth reporter does not require rewriting the diff engine.

Exit codes

The exit code is the load-bearing CI signal. The policy is deliberately distinct from the test runner's exit codes so that mcptest run and mcptest diff can be invoked from the same script without ambiguity.

CodeMeaning
0No changes, or only NON-BREAKING changes detected.
6BREAKING changes detected.
7POTENTIALLY BREAKING changes detected, no BREAKING changes. (Optional, off by default; enabled with --strict-potentially-breaking.)
64CLI usage error (missing file, bad flag). Inherits the EX_USAGE convention.
70Internal error (panic, IO failure, cassette parse failure). Inherits the EX_SOFTWARE convention.

Exit code 1 is reserved for mcptest run test failures; diff never uses it. Code 6 was chosen because it does not collide with any reserved Unix sysexit code and because it is mnemonic ("breaking changes").

Cassette refresh path

Older cassettes recorded before the catalog-aware recorder shipped do not necessarily include tools/list and resources/list. When mcptest diff loads such a cassette, it exits with code 64 and prints:

error: cassette `<path>` does not include a tools/list response.
hint:  refresh the cassette catalog with:
       mcptest record --catalog-only --cassette <path>

mcptest record --catalog-only re-runs initialize, tools/list, and resources/list against the original server and patches the cassette in place. The flag is additive: existing recorded interactions inside the cassette are preserved. The recorder fails with a clear error if the original server cannot be reached.

Baseline catalog file

The two-baseline shape uses a YAML file (mcptest.baseline-catalog.yml) that names a set of tools and their argument shapes. The file is a human-readable subset of the cassette format: only the catalog matters, no recorded interactions. The format is the subject of a separate follow-up spec; the diff command treats the baseline file as opaque and delegates parsing to the catalog crate.

The reason to support a hand-authored baseline rather than insisting on a cassette: cassettes record real responses, which may include environment- specific data (account IDs, hostnames) that the catalog should not carry. A baseline file is the "intended public surface" in source control, edited deliberately, reviewed in PRs.

Worked example output

Pretty reporter, against a fixture pair with several intentional changes:

$ mcptest diff cassettes/v1.cassette cassettes/v2.cassette

Tool catalog diff: cassettes/v1.cassette -> cassettes/v2.cassette

Tools added (1):
  + delete_issue
      args: id (string, required)
      description: "Delete a Linear issue by ID."

Tools removed (1):
  - archive_issue (BREAKING)
      last seen with: args.id (string, required)

Tools changed (2):
  create_issue
      args.priority: optional -> required (BREAKING)
      args.labels: default value `[]` removed (BREAKING)
      args.estimate: added, optional, type number (NON-BREAKING)
  list_issues
      result.next_cursor: type widened string -> string|null (BREAKING)

Resources unchanged.

Summary: 4 BREAKING, 1 NON-BREAKING, 0 POTENTIALLY BREAKING.
Exit code: 6

The JSON reporter renders the same diff as:

{
  "summary": {
    "breaking": 4,
    "potentially_breaking": 0,
    "non_breaking": 1
  },
  "changes": [
    {
      "kind": "tool_added",
      "severity": "non_breaking",
      "path": "tools.delete_issue",
      "before": null,
      "after": { "name": "delete_issue", "args": [{ "name": "id", "type": "string", "required": true }] }
    },
    {
      "kind": "tool_removed",
      "severity": "breaking",
      "path": "tools.archive_issue",
      "before": { "name": "archive_issue" },
      "after": null
    },
    {
      "kind": "arg_optional_to_required",
      "severity": "breaking",
      "path": "tools.create_issue.args.priority",
      "before": { "required": false },
      "after": { "required": true }
    }
  ]
}

JUnit, Markdown, and SARIF outputs follow the same internal model.

Open questions

Three questions are deferred to the implementation ticket. Each is called out here so reviewers know they are unresolved.

  1. Resource URI templates with parameter renames. A URI template like issues/{issue_id} becoming issues/{id} changes the parameter name but not the structural shape. Is that a BREAKING change (clients parsing the parameter name fail), a POTENTIALLY BREAKING change (clients matching on shape are fine), or NON-BREAKING (most clients pass through opaque)? The current proposal is POTENTIALLY BREAKING, with a flag to upgrade to BREAKING for strict shops.
  2. "Renamed" versus "removed plus added". A tool that gets renamed from create_issue to create_ticket looks identical to a removal plus an addition. The cassette format has no rename annotation today. Options: (a) treat all renames as removal plus addition and document that authors should add a deprecation window; (b) add a rename annotation to the catalog format; (c) heuristic match on argument shape similarity. The proposal is (a) for the first implementation, with (b) as a future follow-up if user demand surfaces.
  3. Description-only changes inside otherwise unchanged tools. A tool whose description text changes but whose argument shape does not is NON-BREAKING by the table above. Some shops want to see those changes in the diff anyway (for changelog generation). The proposal is to emit them in the diff with severity: non_breaking but to suppress them from the summary unless --show-descriptions is passed. Verdict pending implementation.

Implementation notes (non-binding)

References