Spec: Schema evolution diff (`mcptest diff`)

Status: Draft, deferred

Run this example. mcptest diff ships today. examples/diff-tools-baseline.json and examples/diff-tools-current.json differ by a removed tool and a newly required argument, so the diff reports two breaking changes and exits non-zero.

mcptest diff examples/diff-tools-baseline.json examples/diff-tools-current.json

Purpose

mcptest diff compares two snapshots of an MCP server's public surface and flags changes that are likely to break callers. The snapshot is the union of tools/list and resources/list (and any extensions to the discovery surface that ship later). The diff classifies each change by severity and emits structured output that a CI pipeline can act on.

This spec covers the command shape, the severity classification table, the output formats, the exit-code policy, and the worked example output. It does not cover protocol-level diffs (request shape, header behavior, error taxonomy) because those belong to the regular test path, not the catalog path.

Use cases

Pre-merge CI guardrail. A service owner runs mcptest diff against the previous release's catalog baseline as part of CI. The build fails when a tool argument flips from optional to required, an enum value disappears, or a tool is removed without a deprecation window.
Pre-deploy gate. A deploy pipeline runs mcptest diff between the staging cassette and the production cassette before promoting a build. Breaking changes block the promotion.
Audit and changelog generation. A release engineer runs mcptest diff to produce the catalog-change section of the release notes. Output renders to Markdown and drops into the changelog.
Consumer impact assessment. A consumer of a third-party MCP server records a cassette today, records another after the next release, and uses the diff to scope client work.

Command shapes

Three input combinations are supported. All three return the same diff shape; only the inputs differ.

Cassette to cassette

mcptest diff <baseline.cassette> <current.cassette>

Both arguments are recorded cassette files. The diff is computed from the recorded tools/list and resources/list entries in each cassette. If a cassette does not contain one or both responses, the command exits with a clear error pointing at the refresh path (see "Cassette refresh path" below).

Cassette to live server

mcptest diff <baseline.cassette> --against <url|command>

<url|command> is either an HTTP URL the runner can reach (URL transport) or a subprocess specification matching the servers[].command shape in the YAML schema. The runner connects to the live server, issues initialize plus tools/list plus resources/list, and diffs the response against the baseline cassette. Authentication, if needed, comes from the standard mcptest auth surface (env, dotenv, CLI flag).

Two baselines (advanced)

mcptest diff --baseline mcptest.baseline-catalog.yml --against <target>

mcptest.baseline-catalog.yml is a hand-authored or generated catalog file checked into the repo (see "Baseline catalog file" below). The --against target is either a cassette path, a URL, or a subprocess. The two-baseline shape is the canonical "CI guardrail" usage: the baseline file moves only via deliberate review, and any drift between the baseline and the live server fails the build.

Severity classification

Every diff entry carries one of three severities. The classification table is the load-bearing artifact of this spec; reviewers should read it carefully because the severity decides the exit code.

Change	Severity	Notes
Tool removed	BREAKING	Existing callers fail immediately.
Tool renamed	BREAKING	Modeled as removal plus addition unless the rename is annotated (see "Open questions").
Argument optional becomes required	BREAKING	Callers that previously omitted the argument now fail validation.
Argument default value changed	BREAKING	Behavior shift for callers relying on the old default, even though calls still validate.
Argument removed entirely	BREAKING	Callers that previously sent the argument now fail validation.
Argument type narrowed (e.g. `string	null `to` string`)	BREAKING	Callers sending the narrower-out variant fail.
Enum value removed	BREAKING	Callers sending the removed value fail validation.
URI template shape changed	BREAKING	Resource subscribers expecting the old shape break.
Output schema tightened (e.g. field removed, type narrowed)	BREAKING	Consumers parsing the removed field fail.
Output schema gains a required field	POTENTIALLY BREAKING	Consumers that did not previously decode the field may still tolerate it; consumers that strict-parse fail.
Tool added	NON-BREAKING	New capability.
Argument added as optional with default	NON-BREAKING	Existing callers still validate.
Argument type widened (e.g. `string` to `string	null`)	NON-BREAKING	Callers sending the old type still validate.
Enum value added	NON-BREAKING	Callers sending the old set still validate.
Output schema relaxed (field made optional)	NON-BREAKING	Consumers parsing the field still see it when present.
Description text changed	NON-BREAKING	Caller behavior is unaffected. Description quality is a separate concern (see `docs/description-quality.md`).

The severity for a given diff is the maximum severity across all the changes inside that diff. A single tool that adds an optional argument (NON-BREAKING) and removes another tool (BREAKING) gets BREAKING overall.

Output formats

mcptest diff reuses the existing reporter set. The reporter is selected with --format. Supported values:

pretty (default): human-readable text, color in a TTY, ASCII fallback out of a TTY. Used for local runs and developer terminals.
json: structured JSON for downstream tooling. The top-level shape is {summary, changes[]}, where each change is {kind, severity, path, before, after, description}.
junit: each tool change becomes a test case, each BREAKING change is a failed test, each NON-BREAKING change is a passed test, each POTENTIALLY BREAKING change is a skipped test with a message. The shape fits the existing JUnit reporter.
markdown: a PR-comment-friendly summary. Tables for added, removed, and changed tools. Severity is rendered as a leading badge per row. Designed to be pasted into a GitHub or GitLab PR comment by a CI action.
sarif: each change becomes a SARIF result entry, severity maps to SARIF level (error for BREAKING, warning for POTENTIALLY BREAKING, note for NON-BREAKING). Designed for code-scanning surfaces that ingest SARIF (GitHub Advanced Security, Sonar).

All five reporters share the same internal diff model. Adding a sixth reporter does not require rewriting the diff engine.

Exit codes

The exit code is the load-bearing CI signal. The policy is deliberately distinct from the test runner's exit codes so that mcptest run and mcptest diff can be invoked from the same script without ambiguity.

Code	Meaning
0	No changes, or only NON-BREAKING changes detected.
6	BREAKING changes detected.
7	POTENTIALLY BREAKING changes detected, no BREAKING changes. (Optional, off by default; enabled with `--strict-potentially-breaking`.)
64	CLI usage error (missing file, bad flag). Inherits the `EX_USAGE` convention.
70	Internal error (panic, IO failure, cassette parse failure). Inherits the `EX_SOFTWARE` convention.

Exit code 1 is reserved for mcptest run test failures; diff never uses it. Code 6 was chosen because it does not collide with any reserved Unix sysexit code and because it is mnemonic ("breaking changes").

Cassette refresh path

Older cassettes recorded before the catalog-aware recorder shipped do not necessarily include tools/list and resources/list. When mcptest diff loads such a cassette, it exits with code 64 and prints:

error: cassette `<path>` does not include a tools/list response.
hint:  refresh the cassette catalog with:
       mcptest record --catalog-only --cassette <path>

mcptest record --catalog-only re-runs initialize, tools/list, and resources/list against the original server and patches the cassette in place. The flag is additive: existing recorded interactions inside the cassette are preserved. The recorder fails with a clear error if the original server cannot be reached.

Baseline catalog file

The two-baseline shape uses a YAML file (mcptest.baseline-catalog.yml) that names a set of tools and their argument shapes. The file is a human-readable subset of the cassette format: only the catalog matters, no recorded interactions. The format is the subject of a separate follow-up spec; the diff command treats the baseline file as opaque and delegates parsing to the catalog crate.

The reason to support a hand-authored baseline rather than insisting on a cassette: cassettes record real responses, which may include environment- specific data (account IDs, hostnames) that the catalog should not carry. A baseline file is the "intended public surface" in source control, edited deliberately, reviewed in PRs.

Worked example output

Pretty reporter, against a fixture pair with several intentional changes:

$ mcptest diff cassettes/v1.cassette cassettes/v2.cassette

Tool catalog diff: cassettes/v1.cassette -> cassettes/v2.cassette

Tools added (1):
  + delete_issue
      args: id (string, required)
      description: "Delete a Linear issue by ID."

Tools removed (1):
  - archive_issue (BREAKING)
      last seen with: args.id (string, required)

Tools changed (2):
  create_issue
      args.priority: optional -> required (BREAKING)
      args.labels: default value `[]` removed (BREAKING)
      args.estimate: added, optional, type number (NON-BREAKING)
  list_issues
      result.next_cursor: type widened string -> string|null (BREAKING)

Resources unchanged.

Summary: 4 BREAKING, 1 NON-BREAKING, 0 POTENTIALLY BREAKING.
Exit code: 6

The JSON reporter renders the same diff as:

{
  "summary": {
    "breaking": 4,
    "potentially_breaking": 0,
    "non_breaking": 1
  },
  "changes": [
    {
      "kind": "tool_added",
      "severity": "non_breaking",
      "path": "tools.delete_issue",
      "before": null,
      "after": { "name": "delete_issue", "args": [{ "name": "id", "type": "string", "required": true }] }
    },
    {
      "kind": "tool_removed",
      "severity": "breaking",
      "path": "tools.archive_issue",
      "before": { "name": "archive_issue" },
      "after": null
    },
    {
      "kind": "arg_optional_to_required",
      "severity": "breaking",
      "path": "tools.create_issue.args.priority",
      "before": { "required": false },
      "after": { "required": true }
    }
  ]
}

JUnit, Markdown, and SARIF outputs follow the same internal model.

Open questions

Three questions are deferred to the implementation ticket. Each is called out here so reviewers know they are unresolved.

Resource URI templates with parameter renames. A URI template like issues/{issue_id} becoming issues/{id} changes the parameter name but not the structural shape. Is that a BREAKING change (clients parsing the parameter name fail), a POTENTIALLY BREAKING change (clients matching on shape are fine), or NON-BREAKING (most clients pass through opaque)? The current proposal is POTENTIALLY BREAKING, with a flag to upgrade to BREAKING for strict shops.
"Renamed" versus "removed plus added". A tool that gets renamed from create_issue to create_ticket looks identical to a removal plus an addition. The cassette format has no rename annotation today. Options: (a) treat all renames as removal plus addition and document that authors should add a deprecation window; (b) add a rename annotation to the catalog format; (c) heuristic match on argument shape similarity. The proposal is (a) for the first implementation, with (b) as a future follow-up if user demand surfaces.
Description-only changes inside otherwise unchanged tools. A tool whose description text changes but whose argument shape does not is NON-BREAKING by the table above. Some shops want to see those changes in the diff anyway (for changelog generation). The proposal is to emit them in the diff with severity: non_breaking but to suppress them from the summary unless --show-descriptions is passed. Verdict pending implementation.

Implementation notes (non-binding)

The diff engine lives in mcptest-core (new module mcptest_core::catalog::diff). The crate already owns the protocol layer and is the natural home for catalog logic.
The cassette catalog reader lives in mcptest-cassette. It exposes Cassette::catalog() returning Catalog { tools, resources, ... }.
The JSON Schema entry for the baseline catalog file lives in schemas/v1.json under a new top-level key (catalog_baseline).
Reporter integration goes through mcptest-report. Each reporter picks up DiffSummary and renders its native format.
Tests live as fixture-pair integration tests under tests/diff/<scenario>/. The first fixtures are: identical catalogs, one new tool, one removed tool, one breaking arg change, one description-only change.

References

The cassette format (defines what the diff reads).
The expected-failures baseline (the test-result baseline pattern; this spec is the catalog-baseline analogue).
Cassette portability (cassettes recorded in one environment must be diffable in another).
Cassette determinism normalization (the diff must not flag normalized fields as catalog changes).

Spec: Schema evolution diff (mcptest diff)