mcptest docs GitHub

Concepts

This page is the mental model for mcptest. Read it once and the rest of the docs make sense. You will not learn every flag here; you will learn what each flag is for, where it lives, and which other page to open when you need depth.

If you have not run mcptest yet, skim getting-started.md first. It walks the hello-world path. Come back here when you start asking "but why does it work that way?"

The contributor rules in AGENTS.md cover code style, licensing, and the check gate. This page covers product surface area.

1. What mcptest is

mcptest is a CI-grade test runner for MCP servers. You point it at a Model Context Protocol server, hand it a YAML file describing how the server should behave, and it talks to the server, asserts on the responses, and emits a report.

The framing matters. mcptest is not a debugger, not a fuzzer, and not a load tester. It is the same shape of tool you reach for when you write unit tests for a library or integration tests for an API. The thing under test happens to speak MCP, so the runner knows how to initialize the protocol, list tools, call them, watch headers, and grade the responses against a schema.

Three properties define the product:

A test run boils down to four phases:

  1. Load. Read the YAML, merge env vars, resolve variables, validate the file against schemas/v1.json.
  2. Connect. Resolve the server target, open a transport (stdio subprocess or HTTP), perform the MCP initialize handshake.
  3. Execute. Walk each test, issue the call, capture the response, run matchers, record performance numbers.
  4. Report. Aggregate pass/fail, render reporters, emit cassettes if in record mode, exit with the right code.

Every concept on this page lives inside one of those phases. When something behaves unexpectedly, ask which phase you are in first.

2. Test anatomy

A test file is a single YAML document with three load-bearing pieces:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  local:
    command: ["./target/debug/my-mcp-server"]

variables:
  workspace_root:
    value: "/tmp/sandbox"

tools:
  - name: "list_directory returns at least one entry"
    server: local
    tool: list_directory
    args:
      path: "${workspace_root}"
    expect:
      assertions:
        - target: result.content
          matcher:
            schema:
              type: array
              minItems: 1
      max_duration_ms: 250

That structure repeats across every test file you will write. The servers block declares targets, the variables block declares substitutions, and the tools array (or compliance, or evals) declares the actual tests.

Inside each test the shape is consistent:

The full grammar lives in the YAML reference. This section is the overview; the reference has every field.

Tools, resources, prompts

MCP exposes three primitive surfaces:

mcptest tests each primitive with the same structural pattern. The only thing that changes is the key under the test (tool:, resource:, prompt:) and the shape of the response envelope you assert on. The YAML reference shows the exact fields per primitive.

Compliance tests

compliance: is a separate top-level array of protocol-level checks that do not look like normal calls. They probe the MCP handshake, the tools/list shape, the error model, transport-specific headers, and so on. They exist because protocol conformance is its own concern: a server can return the right answer for the right tool and still send a malformed initialize response. See section 7 for the corpus.

Evals

evals: runs a model-graded check against the response. The matcher calls an LLM with a rubric and records the verdict. Evals are the only kind of test that is non-deterministic by design. The cache marks them ineligible automatically (section 6).

3. Matchers

A matcher is the function that decides whether a response is acceptable. Every assertion uses exactly one. The six below carry the bulk of real suites, listed here in rough order of strictness. The deterministic set also includes subset, the string matchers (contains-all, contains-any, icontains, starts-with, levenshtein), is-json, is-valid-tools-call, and the not combinator that inverts any other matcher. The model-graded llm-jury matcher is covered in section 13 alongside llm-judge. The YAML reference documents every option.

exact

Equality check. The captured value must equal the literal you specify, structure and all.

matcher:
  exact: { ok: true, count: 3 }

Use this when the response is small, fully known, and you want any drift to fail loudly.

schema

JSON Schema validation. The captured value must validate against the supplied schema. Supports the same drafts the jsonschema crate supports.

matcher:
  schema:
    type: array
    minItems: 1
    items:
      type: object
      required: [name, path]

Use this for responses whose shape is fixed but whose contents vary. Most production tests are schema matchers.

regex

Regular-expression match against a string. The captured value is coerced to a string with serde_json::to_string if it is not already one, and the regex must match anywhere in it (or fully, if anchored).

matcher:
  regex: "^Hello, [A-Za-z]+!$"

Useful for free-form text output where you care about the format but not the exact wording.

contains

Structural containment. For objects, the captured value must contain every key-value pair (recursively, extra keys ignored). For arrays, every expected element must have a distinct match (multiset, any order). Scalars fall back to deep equality. This is not a string substring check; for that use icontains, contains-all, or regex.

matcher:
  contains:
    metadata:
      project: mcptest

Use this when the response has extra fields you do not care about and you only want to assert on a subset.

snapshot

Snapshot match against a stored value. The matcher value is a key. On the first run mcptest writes the captured value into <suite_dir>/snapshots/<key>.json. On every later run the captured value must deep-equal the stored one. Run with --update-snapshots (short form -u) to re-record after an intentional change.

matcher:
  snapshot: "lists-tools-content"

Use this for large structured responses where listing every field would be tedious and equality is what you want.

llm-judge

Model-graded match. mcptest sends the response, the prompt, and a rubric to an LLM and asks for a verdict plus reasoning. Records both, fails the test if the verdict is fail or the score is below the threshold.

matcher:
  llm-judge:
    rubric: "The response is a polite refusal that mentions the policy."
    model: anthropic/claude-opus-4-7
    pass_threshold: 0.7

Use this for evaluations where there is no fixed correct answer but there is a quality bar. Evals are not cacheable (section 6) and are billed by the upstream model provider, so the cost cap (section 12) applies.

4. Performance budgets

Most tests express "what" the response should be. Performance budgets express "how fast" or "how cheap" the response should be. They live in the same expect: array as matchers and they fail the test the same way.

The three budgets ship today:

expect:
  - target: result.content
    matcher:
      schema: { type: array }
  - max_duration_ms: 250
  - max_response_tokens: 800
  - response_headers:
      x-served-by: "mcp-edge"
  - response_headers_absent:
      - server
      - x-powered-by

Why these exist:

Performance budgets are intentionally a separate concept from matchers. A latency or token failure means "the response was right but too expensive," and reporters render those failures in a different bucket so triage stays sane.

Performance tests are not cacheable (section 6). Reading a cached duration would be a lie about the current state of the server.

5. Cassettes

A cassette is a recorded MCP conversation: every request, every response, every header. mcptest reads cassettes back in replay mode instead of talking to the live server. The same test, the same matchers, with the network swapped out for a JSON file.

The model is "record once, replay many times":

# First run: record against a live server. Cassettes land under
# <suite_dir>/cassettes/.
mcptest run --record

# Every later run: replay the matching cassette, no live server and no
# flag needed.
mcptest run

Cassettes solve three problems:

The catch: cassettes capture exactly what the server returned, and real servers return timestamps, UUIDs, and request ids that change every call. If you replay a raw recording the matchers will fail because today's timestamp is not yesterday's.

mcptest fixes this with determinism normalization. Before storing a response in a cassette and before comparing a live response against a cassette, the runner rewrites known-volatile fields to canonical placeholders. The defaults cover:

You can add your own normalization rules in the YAML under cassette.normalize:. See the YAML reference for the schema. The broad rule: anything that is not load-bearing for the assertion should be normalized away.

Two things cassettes deliberately do not capture: wall-clock duration and token counts. Both are properties of the live system, not the response, so performance budgets are skipped in replay mode (the matcher engine surfaces this as skipped: replay). If you need to assert on latency you have to run against the live server.

6. Cache

The cache is the layer above cassettes. A cassette is a recording you explicitly produce and replay. The cache is automatic: when a test is provably deterministic, mcptest stores its outcome and skips the re-run on later invocations until inputs change.

The eligibility engine (mcptest-core::cache::eligibility) decides per test whether the result can be cached at all. The full ruleset is in cache-eligibility.md; the summary:

Cacheable by default

Not cacheable by default

Never cacheable, regardless of test type

Author overrides

The mental model: the cache only kicks in when the runner can prove that today's result will equal yesterday's. Whenever the proof breaks down, the cache steps out of the way and the live test runs.

7. Compliance corpus

Compliance is the protocol-conformance layer. The corpus is a bundle of YAML rule definitions under compliance/, with registry.yml as the metadata source of truth. Each rule has a stable ID in the PREFIX-NNN form (for example PROTO-001). The prefixes group rules by area:

PrefixWhat it covers
PROTOinitialize handshake, version negotiation, JSON-RPC protocol.
SCHEMAFrame shape, id rules, error envelope conformance.
SEQSequencing: initialize before other calls, no calls after shutdown.
TOOLtools/list, tools/call, schema advertised vs returned.
RESresources/list, resources/read, URI handling.
PROMPTprompts/list, prompts/get, argument substitution.
AUTHOAuth flows, bearer header shape, refresh on 401.
TRANSPORTstdio framing vs HTTP status codes and chunked transfer.
HEADERRequired and forbidden HTTP headers per spec.
REPLAYCassette and replay determinism.
EDGEEmpty arrays, oversized payloads, Unicode names.

Newer protocol features add their own prefixes (ELICIT, COMPL, RESTPL, PROG, ROOTS, SAMPLE, and COMPAT), each gated on the matching server capability so the rule only runs when the server advertises the feature. The corpus README in compliance/README.md documents how a rule is authored.

Each rule carries an RFC 2119 severity:

Run the suite with mcptest compliance. The companion compliance baseline lets you accept known failures without losing the green build.

Rule IDs are stable. The naming is locked so a rule called PROTO-007 today means the same check next year. Numbers are never reused; a retired rule keeps its ID with a deprecation stamp.

8. Reporters

A reporter turns a finished run into output. --reporter picks the format and --output picks the sink: a file path, or stdout when omitted or set to -. Every format works on both surfaces, so you can render directly from a run or re-render a saved record later with mcptest report --format <FORMAT>.

# Render any format straight from a run, to stdout or a file.
mcptest run --reporter junit --output reports/results.xml
mcptest run --reporter tap            # TAP to stdout

# Or run once, save the JSON record, then re-render any format from it.
mcptest run --reporter json --output reports/run.json
mcptest report reports/run.json --format sarif --output reports/results.sarif

The report formats all read from the same run record, so they cannot drift from each other. If a test passes in one format and fails in another you have hit a bug, file it.

GitHub Actions annotations

--annotations composes inline GitHub Actions annotations on top of any format: the report goes to its chosen sink, and one ::error/::warning workflow command per failure goes to stderr, which Actions turns into inline annotations on the PR diff. auto (the default) emits only inside Actions, always forces them, never disables them.

mcptest run --reporter junit --output results.xml --annotations always

9. Server target sources

A run needs to know which MCP server to talk to. mcptest resolves the target from a chain of sources, with the higher-listed source winning when more than one defines the same server. The precedence:

  1. CLI flags. --server name=command... or --server name=url. Direct flags always win; this is the lever you use in CI to override a checked-in test file without editing it.
  2. --server-config <file>. A YAML file you point at with the flag. Overrides everything below.
  3. mcptest.local.yml in the working directory. This file is gitignored by convention and holds developer-local targets (different ports, alternate auth tokens).
  4. mcptest.yml in the working directory. The committed config.
  5. servers: block in the test file itself. The fallback. Useful for single-file examples and small repos.

The same precedence applies per server name. If mcptest.yml declares local as a subprocess server and you pass --server local=https://staging.example.com, the CLI flag wins for that name; everything else still loads from the file.

Run mcptest doctor to print the resolved target for each server along with the source it came from. When something is talking to the wrong place, doctor will tell you why.

10. Variable resolution

Tests interpolate ${variable_name} in their YAML. The interpolation engine resolves names against a stack of sources, again with the higher-listed source winning:

  1. --var KEY=VALUE on the CLI. Repeatable.
  2. --env-file <path> on the CLI. Repeatable; later files override earlier ones.
  3. OS environment variables present in the process.
  4. .env.local in the working directory.
  5. .env.test in the working directory.
  6. .env in the working directory.
  7. variables: block in the test file.

Lookup walks the list top-down for each ${name}. The first source that defines the name wins; the rest are ignored for that name.

The split between --env-file and .env.local is intentional. --env-file is for explicit CI loads ("this pipeline uses these env vars"). .env.local is for the developer's machine. .env.test and .env are committed defaults.

Variables can reference each other within the same source as long as the reference is fully resolvable at load time; circular references fail the run with a typed error.

Secrets resolve through the same engine. A field with the suffix *_Env: API_KEY reads the value of the env variable API_KEY through the same precedence stack. The runner redacts the resolved value from logs and reporters by default; see security/redaction.md.

11. Auth model

mcptest knows three ways to authenticate to an MCP server.

On a 401 response the runner does a single refresh-on-401 retry: for OAuth it asks the token endpoint for a new access token with the stored refresh token; for bearer-token mode it re-reads the env variable in case the user rotated it mid-run; for custom headers it re-resolves the *-Env values. The retry happens once per request. If the refreshed call also fails with 401 the test fails with a clear "auth still failed after refresh" error. See auth-refresh.md for the state machine.

Auth state is per-server, not per-test. Tests against the same server within a single run share the same access token (and the same refresh, if it triggers).

12. Exit codes

mcptest exits with a small fixed set of codes. CI systems read them to decide whether to fail the build and which classification to apply. The full table:

CodeMeaning
0All tests passed (or the command did what it was asked).
1One or more tests failed (matcher mismatch, eval verdict fail, compliance regression).
2Config error or invalid arguments (YAML did not load, schema validation failed, unknown flag).
5Cost cap exceeded (--max-cost on eval), or --update-snapshots refused under CI=true.
6Coverage below --coverage-threshold, or a model-compat DRIFT.
7No tests selected (empty suite, or --filter/--shard/--last-failed matched nothing) and --pass-with-no-tests was not set.

Codes outside this set are reserved. Codes 3 and 4 are not wired in the v1.0 binary, so do not write CI that special-cases them; a transport or auth failure surfaces as exit 1 with a diagnostic message. The full table, with the subcommand that returns each code, is in the CLI reference.

The CI integration guide (ci-integration.md) shows how each major CI provider surfaces these codes.

13. Agent end-to-end tests

The tool / resource / prompt test types check the protocol surface ("given args, expect this response shape"). The agent test type goes further: it points a real model at one or more MCP servers, drives the conversation, and asserts against the resulting trace. That covers the loop a real user hits, not just the wire format.

A minimal agent test:

servers:
  weather:
    command: ["./weather-server"]

agents:
  - name: weather query routes to get_weather
    model: claude-sonnet-4-5
    servers: [weather]
    prompt: What is the weather in Sacramento?
    expect:
      - target: tool_calls[0].name
        matcher: { exact: get_weather }
      - target: tool_calls[0].args.city
        matcher: { regex: "(?i)sacramento" }
      - target: final_response
        matcher: { contains: Sacramento }
      - target: conversation.tokens.total
        matcher: { regex: "^[0-9]+$" }

The matcher target grammar resolves against the conversation trace:

The driver walks the conversation: it lists tools on every server in servers:, sends the prompt to the model with the merged tool catalog attached, dispatches each tool the model calls back to the owning server, and stops when the model returns plain text or the max_turns: cap trips.

Three flavors of agent test

  1. Single model, single server (the example above): deterministic sanity check that the model still picks the right tool.
  2. Model matrix via models:: one test, many models, pass / fail per model in the report. The use case is "when a new Claude or GPT or Gemini drops, do my existing tests still pass?"

    agents:
      - name: weather query
        models:
          - claude-sonnet-4-5
          - gpt-5
          - gemini-2.5-pro
        servers: [weather]
        prompt: What is the weather in Sacramento?
        expect:
          - target: tool_calls[0].name
            matcher: { exact: get_weather }

    Each (test, model) pair becomes its own row in the report: mcptest::weather query [claude-sonnet-4-5] PASS. The reporter shows which assertion broke for which model.

  3. Multi-server agent: one model, tools from every named server. Useful when the real workflow spans multiple servers (issues + notifications, search + crawl, calendar + email).

    agents:
      - name: open issue and notify oncall
        model: claude-sonnet-4-5
        servers: [issues, notifications]
        prompt: Open a P1 issue for the failing CI run and ping #oncall.
        expect:
          - target: tool_calls[0].server
            matcher: { exact: issues }
          - target: tool_calls[1].server
            matcher: { exact: notifications }

Providers and credentials

Auto-detected families:

FamilyDetected when model starts withEnv var
Anthropicclaude-ANTHROPIC_API_KEY
OpenAIgpt-, chatgpt-, o<digit>, text-, davinci-OPENAI_API_KEY (+ optional OPENAI_ORG_ID)
Googlegemini-, models/gemini-GEMINI_API_KEY (or GOOGLE_API_KEY)
Mistralmistral-, codestral-, magistral-, ministral-, devstral-, pixtral-, open-mistral-, open-mixtral-MISTRAL_API_KEY

For Azure OpenAI, OpenRouter, vLLM, llama.cpp, LiteLLM, Together, Groq, Anyscale, Fireworks, or any other OpenAI-API-compatible endpoint, declare a named provider under top-level providers: and reference it from models::

providers:
  openrouter:
    type: openai
    base_url: https://openrouter.ai/api/v1
    api_key_env: OPENROUTER_API_KEY

agents:
  - name: my test
    models:
      - { provider: openrouter, id: anthropic/claude-3.5-sonnet }
    ...

When the env var for a model is missing, the run falls through to a deterministic stub provider so CI stays green; the reporter logs which provider was actually used so you can spot a partial matrix.

Cassettes and CI

Each (agent, model) (and (agent, provider, model) for named providers) gets its own cassette at cassettes/<agent_slug>__[<provider>__]<model_slug>.json. The workflow:

# Once, with whatever keys you have set:
mcptest run --record

# Every subsequent run replays the cassettes; no API key needed,
# no network calls to the provider, deterministic per-test cost
# of zero cents.
mcptest run

A stale cassette (the model id or prompt changed in YAML) surfaces as a cassette stale on field <field> error before any matchers run, so you cannot silently pass against an out-of-date recording.

Two extra matchers: llm-judge and llm-jury

Sometimes the assertion you want is "did the model actually answer the question, in a way that stays grounded in the tool result?" That is not expressible as a regex. Two matchers fill the gap:

Worked example: examples/agent-llm-judge.yml.

Spend caps

The top-level budget: block caps the dollars an agent run can spend per test and per suite. Both fields are USD cents, both default to unbounded.

budget:
  per_test_usd_cents: 50
  per_suite_usd_cents: 500

The runner stops the agent loop and surfaces a clear error when a cap trips, before the provider issues a surprise bill.

Further reading

Full docs in docs/models.md. Worked examples under examples/: agent-weather.yml (singleton), agent-matrix.yml (matrix), agent-custom-providers.yml (named providers), agent-llm-judge.yml (judge / jury), agent-issues-and-notifications.yml (multi-server).

14. The seven failure layers for URL targets

When a server target is an HTTP URL, opening the transport can fail in seven distinct places. mcptest reports the specific layer that failed so you do not have to guess. The layers, in the order the runner walks them:

  1. DNS. Hostname resolution failed (NXDOMAIN, timeout, etc.). Symptom: error: failed to resolve <host>.
  2. TCP. DNS resolved but the connection was refused or timed out. Symptom: error: tcp connect failed.
  3. TLS. TCP connected but the TLS handshake failed (expired cert, name mismatch, unsupported protocol). Symptom: error: tls handshake failed. mcptest uses rustls; native-tls is not available.
  4. HTTP. TLS finished but the HTTP request returned an unexpected status before MCP semantics applied (502 from a gateway, 404 at the wrong path). Symptom: error: http <status> from <url>.
  5. Authentication. HTTP succeeded but the server returned 401 or 403 and refresh-on-401 did not recover. Symptom: error: authentication failed after refresh.
  6. MCP initialize. Auth succeeded but the MCP initialize exchange failed (missing fields, version mismatch, protocol error). Symptom: error: mcp initialize rejected: <reason>.
  7. Readiness. Initialize succeeded but the server reported ready: false or did not advertise the capabilities the test requires. Symptom: error: server initialized but reported not ready.

Each layer maps to a different fix. DNS is your hostname; TCP is your firewall or process; TLS is your cert; HTTP is your gateway; auth is your credentials; initialize is your MCP implementation; readiness is your startup sequence.

mcptest doctor --url <url> walks the same layers in order and prints a one-line result per layer, so you can see how far you got without running an actual test.

Subprocess (stdio) servers fail in fewer places: spawn, stdio framing, MCP initialize, readiness. Same idea, four layers instead of seven.

15. Where to next?

This page is the overview. The detail is split across these pages.

If you finish this page and still feel uncertain about which matcher to use, which reporter to wire up, or which exit code your CI should special-case, that is a docs gap. File an issue at https://github.com/soapbucket/mcptest/issues, link this page, and we will fill it in.