mcptest docs GitHub

Metamorphic testing

Status: implemented behind the preview schema flag. Tracked as epic WOR-1236 and child WOR-1237. The relation engine and the runner that issues the transformed calls both ship; the metamorphic block is marked preview in the schema while the surface settles.

A matcher in mcptest answers one question: given these arguments, is this response correct. To ask it you have to know the correct response and write it down. That works for a tool with a stable answer. It does not work for a search, a summarizer, a recommender, or a ranking tool, where the right output drifts with the index, the model, and the day. Those tools get left untested, or pushed onto an llm-judge that costs money and is not deterministic.

Metamorphic testing removes the requirement. Instead of a golden output, you assert a relation between calls: a property that must hold between the output of one call and the output of a related call, whatever the actual values are. A search that returns the same results in a different order is a bug, and you can say so without knowing what the results should be. This is the standard answer to the test-oracle problem, brought to MCP tools by recent work (Multi-Agent LLM-based Metamorphic Testing for REST APIs, arXiv:2605.28321). MCP tool schemas look like OpenAPI operations, so the relations carry over.

The default comparison runs no model.

The relation catalog

Each relation pairs an input transform with the property the output must keep. All of these are deterministic.

A relation needs two runs and a comparison, nothing else. Equality reuses the same snapshot normalization pass the cassette path uses, so two recordings diff cleanly and an incidental field like a timestamp does not trip a false violation.

Named presets

Picking relations for a new tool is the hard part: which of the catalog above even applies? Two built-in presets bundle the relations that need no per-tool argument, so a tool with no stable golden output is testable out of the box. Name them under presets: instead of (or alongside) relations::

    metamorphic:
      presets: [universal, text]

Presets expand to the concrete relations above and merge with any explicit relations:, de-duplicated, so the report and the metamorphic.* targets are identical to having listed those relations by hand. The argument-naming relations (noop_filter, monotone, subset_under_filter, and the rest) stay explicit: they need a tool-specific argument, so no preset can guess them. The preset catalog is drawn from the metamorphic-testing survey (arXiv:2511.02108) reduced to the structurally-applicable, parameter-free relations.

A worked example

The base call searches for anthropic with two filters. Three relations are asserted against it.

tools:
  - name: search is order-insensitive and idempotent
    tool: search
    args: { query: "anthropic", filters: ["lang:en", "type:doc"] }
    metamorphic:
      relations:
        - idempotent
        - arg_order_insensitive
        - { noop_filter: { arg: filters, value: "*" } }
        - { monotone: { arg: limit } }

The runner makes the base call once, then makes one transformed call per relation: the same call again (idempotent), the call with filters reversed and query keys permuted (arg_order_insensitive), the call with a select-everything filter appended (noop_filter), and the call with limit raised (monotone). Each transformed result is compared against the base under the relation's property. Any deterministic relation that fails is a violation.

The gate

The gate fails the test on any violation of a deterministic relation. A violation is a concrete, reproducible disagreement (the same query returned a different result set when nothing that should matter changed), so it is safe to gate on without a flake budget.

Assertable targets

The check exposes three targets. The names are exact.

TargetMeaning
metamorphic.relations_checkedCount of relations evaluated.
metamorphic.violationsCount of relations that failed.
metamorphic.gate_passed1 when no deterministic relation failed, 0 otherwise.

Write an explicit expect: to assert a target directly, or omit it to get the default gate (fail on any deterministic violation).

    metamorphic:
      relations: [idempotent, arg_order_insensitive]
      expect:
        - target: metamorphic.violations
          matcher: { schema: { maximum: 0 } }

The optional model-assisted relation

One relation, paraphrase_invariant, needs a model to generate a benign paraphrase of the input (a classification must not change when the prompt is reworded). It follows the same rule as the narrative-trace check:

What it does not do

Metamorphic testing checks that a tool is self-consistent, not that it is good. A search that returns confidently wrong results for every query is consistent under every relation here and passes them all. So this is a floor, not a ceiling. Pair it with a few golden-output tests for the inputs you do know the answer to, and with llm-judge for subjective quality. What it buys you is the long tail: the thousands of inputs you will never write a golden answer for, gated deterministically and for free.