Concepts
This page is the mental model for mcptest. Read it once and the rest of the docs make sense. You will not learn every flag here; you will learn what each flag is for, where it lives, and which other page to open when you need depth.
If you have not run mcptest yet, skim getting-started.md first. It walks the hello-world path. Come back here when you start asking "but why does it work that way?"
The contributor rules in AGENTS.md cover code style, licensing, and the check gate. This page covers product surface area.
1. What mcptest is
mcptest is a CI-grade test runner for MCP servers. You point it at a Model Context Protocol server, hand it a YAML file describing how the server should behave, and it talks to the server, asserts on the responses, and emits a report.
The framing matters. mcptest is not a debugger, not a fuzzer, and not a load tester. It is the same shape of tool you reach for when you write unit tests for a library or integration tests for an API. The thing under test happens to speak MCP, so the runner knows how to initialize the protocol, list tools, call them, watch headers, and grade the responses against a schema.
Three properties define the product:
- Deterministic by default. A passing run today should pass tomorrow with the same inputs. Cassettes, normalization, and the cache eligibility engine exist to make that true even when the underlying server is not perfectly deterministic.
- CI-friendly. Exit codes are stable, reporters cover the formats your CI already understands (JUnit, SARIF, GitLab, JSON), and a single run produces machine-readable evidence you can store as build artifacts.
- Single-developer scope. Everything you need on your own machine ships in the binary.
A test run boils down to four phases:
- Load. Read the YAML, merge env vars, resolve variables, validate the file against
schemas/v1.json. - Connect. Resolve the server target, open a transport (stdio subprocess or HTTP), perform the MCP
initializehandshake. - Execute. Walk each test, issue the call, capture the response, run matchers, record performance numbers.
- Report. Aggregate pass/fail, render reporters, emit cassettes if in record mode, exit with the right code.
Every concept on this page lives inside one of those phases. When something behaves unexpectedly, ask which phase you are in first.
2. Test anatomy
A test file is a single YAML document with three load-bearing pieces:
# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
local:
command: ["./target/debug/my-mcp-server"]
variables:
workspace_root:
value: "/tmp/sandbox"
tools:
- name: "list_directory returns at least one entry"
server: local
tool: list_directory
args:
path: "${workspace_root}"
expect:
assertions:
- target: result.content
matcher:
schema:
type: array
minItems: 1
max_duration_ms: 250
That structure repeats across every test file you will write. The servers block declares targets, the variables block declares substitutions, and the tools array (or compliance, or evals) declares the actual tests.
Inside each test the shape is consistent:
- A call: which server, which tool/resource/prompt, with which arguments.
- An expect block: one or more entries, each either a matcher against a JSON path target, a performance budget, or a header assertion.
The full grammar lives in the YAML reference. This section is the overview; the reference has every field.
Tools, resources, prompts
MCP exposes three primitive surfaces:
- Tools are functions the model can invoke. Each one has a name, a JSON Schema for arguments, and returns a result envelope.
- Resources are addressable content the model can read (a file, a URL, a database row).
- Prompts are server-side template snippets the model can ask for.
mcptest tests each primitive with the same structural pattern. The only thing that changes is the key under the test (tool:, resource:, prompt:) and the shape of the response envelope you assert on. The YAML reference shows the exact fields per primitive.
Compliance tests
compliance: is a separate top-level array of protocol-level checks that do not look like normal calls. They probe the MCP handshake, the tools/list shape, the error model, transport-specific headers, and so on. They exist because protocol conformance is its own concern: a server can return the right answer for the right tool and still send a malformed initialize response. See section 7 for the corpus.
Evals
evals: runs a model-graded check against the response. The matcher calls an LLM with a rubric and records the verdict. Evals are the only kind of test that is non-deterministic by design. The cache marks them ineligible automatically (section 6).
3. Matchers
A matcher is the function that decides whether a response is acceptable. Every assertion uses exactly one. The six below carry the bulk of real suites, listed here in rough order of strictness. The deterministic set also includes subset, the string matchers (contains-all, contains-any, icontains, starts-with, levenshtein), is-json, is-valid-tools-call, and the not combinator that inverts any other matcher. The model-graded llm-jury matcher is covered in section 13 alongside llm-judge. The YAML reference documents every option.
exact
Equality check. The captured value must equal the literal you specify, structure and all.
matcher:
exact: { ok: true, count: 3 }
Use this when the response is small, fully known, and you want any drift to fail loudly.
schema
JSON Schema validation. The captured value must validate against the supplied schema. Supports the same drafts the jsonschema crate supports.
matcher:
schema:
type: array
minItems: 1
items:
type: object
required: [name, path]
Use this for responses whose shape is fixed but whose contents vary. Most production tests are schema matchers.
regex
Regular-expression match against a string. The captured value is coerced to a string with serde_json::to_string if it is not already one, and the regex must match anywhere in it (or fully, if anchored).
matcher:
regex: "^Hello, [A-Za-z]+!$"
Useful for free-form text output where you care about the format but not the exact wording.
contains
Structural containment. For objects, the captured value must contain every key-value pair (recursively, extra keys ignored). For arrays, every expected element must have a distinct match (multiset, any order). Scalars fall back to deep equality. This is not a string substring check; for that use icontains, contains-all, or regex.
matcher:
contains:
metadata:
project: mcptest
Use this when the response has extra fields you do not care about and you only want to assert on a subset.
snapshot
Snapshot match against a stored value. The matcher value is a key. On the first run mcptest writes the captured value into <suite_dir>/snapshots/<key>.json. On every later run the captured value must deep-equal the stored one. Run with --update-snapshots (short form -u) to re-record after an intentional change.
matcher:
snapshot: "lists-tools-content"
Use this for large structured responses where listing every field would be tedious and equality is what you want.
llm-judge
Model-graded match. mcptest sends the response, the prompt, and a rubric to an LLM and asks for a verdict plus reasoning. Records both, fails the test if the verdict is fail or the score is below the threshold.
matcher:
llm-judge:
rubric: "The response is a polite refusal that mentions the policy."
model: anthropic/claude-opus-4-7
pass_threshold: 0.7
Use this for evaluations where there is no fixed correct answer but there is a quality bar. Evals are not cacheable (section 6) and are billed by the upstream model provider, so the cost cap (section 12) applies.
4. Performance budgets
Most tests express "what" the response should be. Performance budgets express "how fast" or "how cheap" the response should be. They live in the same expect: array as matchers and they fail the test the same way.
The three budgets ship today:
max_duration_mscaps the wall-clock time from request send to response receive. The runner records the exact duration on every test regardless; this is the threshold that turns red.max_response_tokenscaps the token count of the response payload. Counted with a fast heuristic tokenizer by default; pass a model id to use a model-specific tokenizer.response_headersasserts that the listed headers are present with the listed values.response_headers_absentasserts that the listed headers are missing.
expect:
- target: result.content
matcher:
schema: { type: array }
- max_duration_ms: 250
- max_response_tokens: 800
- response_headers:
x-served-by: "mcp-edge"
- response_headers_absent:
- server
- x-powered-by
Why these exist:
- Latency drift is a regression even when the response is correct. A tool that quietly takes ten seconds is broken, whether or not the output is right.
- Token bloat is a cost regression for downstream LLM workloads. An MCP tool that returns a 20 KB envelope every call is a tax on every agent that uses it.
- Header drift catches transport-level regressions: dropping a CORS header, adding
Server: nginxafter a proxy refactor, forgettingCache-Control.
Performance budgets are intentionally a separate concept from matchers. A latency or token failure means "the response was right but too expensive," and reporters render those failures in a different bucket so triage stays sane.
Performance tests are not cacheable (section 6). Reading a cached duration would be a lie about the current state of the server.
5. Cassettes
A cassette is a recorded MCP conversation: every request, every response, every header. mcptest reads cassettes back in replay mode instead of talking to the live server. The same test, the same matchers, with the network swapped out for a JSON file.
The model is "record once, replay many times":
# First run: record against a live server. Cassettes land under
# <suite_dir>/cassettes/.
mcptest run --record
# Every later run: replay the matching cassette, no live server and no
# flag needed.
mcptest run
Cassettes solve three problems:
- CI without infrastructure. Your test suite no longer requires the server to be running on the CI box. Record locally, commit the cassette, replay in CI.
- Reproducibility. A failure in CI can be reproduced byte-for-byte by checking out the same cassette and running the same matchers.
- Cost control for evals. Recording an LLM-graded test once and replaying it every CI run keeps your API bill flat.
The catch: cassettes capture exactly what the server returned, and real servers return timestamps, UUIDs, and request ids that change every call. If you replay a raw recording the matchers will fail because today's timestamp is not yesterday's.
mcptest fixes this with determinism normalization. Before storing a response in a cassette and before comparing a live response against a cassette, the runner rewrites known-volatile fields to canonical placeholders. The defaults cover:
- ISO 8601 timestamps to
<TIMESTAMP> - UUIDs (any version) to
<UUID> - Unix epoch seconds in numeric fields named
created_at,updated_atto<EPOCH> - Common request-id headers (
x-request-id,x-correlation-id) to<REQUEST_ID> - IP addresses and ephemeral ports in
host:portstrings to<HOST>:<PORT>
You can add your own normalization rules in the YAML under cassette.normalize:. See the YAML reference for the schema. The broad rule: anything that is not load-bearing for the assertion should be normalized away.
Two things cassettes deliberately do not capture: wall-clock duration and token counts. Both are properties of the live system, not the response, so performance budgets are skipped in replay mode (the matcher engine surfaces this as skipped: replay). If you need to assert on latency you have to run against the live server.
6. Cache
The cache is the layer above cassettes. A cassette is a recording you explicitly produce and replay. The cache is automatic: when a test is provably deterministic, mcptest stores its outcome and skips the re-run on later invocations until inputs change.
The eligibility engine (mcptest-core::cache::eligibility) decides per test whether the result can be cached at all. The full ruleset is in cache-eligibility.md; the summary:
Cacheable by default
exactmatchers.schemamatchers.containsmatchers.snapshotmatchers (the snapshot file is the cache).- Compliance tests (read-only protocol probes).
Not cacheable by default
- Performance tests (
max_duration_ms,max_response_tokens, header budgets). Cached durations would lie about the current state of the server. - Evals (
llm-judge). Sampling makes them non-deterministic by design.
Never cacheable, regardless of test type
- Any test with
effects: [external]. The test admits it changes the world; rerunning is the only safe behavior. - Any test that declares a
hooks:block. Hooks may have side effects the cache cannot see. - Tests against HTTP transports without an explicit
server_version:pin (compliance is exempt; see eligibility doc for the rationale).
Author overrides
cache: neveron a test pins it uncacheable, beats every other rule.cache: alwayson a test pins it cacheable, but cannot override the "never cacheable" exclusions above.
The mental model: the cache only kicks in when the runner can prove that today's result will equal yesterday's. Whenever the proof breaks down, the cache steps out of the way and the live test runs.
7. Compliance corpus
Compliance is the protocol-conformance layer. The corpus is a bundle of YAML rule definitions under compliance/, with registry.yml as the metadata source of truth. Each rule has a stable ID in the PREFIX-NNN form (for example PROTO-001). The prefixes group rules by area:
| Prefix | What it covers |
|---|---|
| PROTO | initialize handshake, version negotiation, JSON-RPC protocol. |
| SCHEMA | Frame shape, id rules, error envelope conformance. |
| SEQ | Sequencing: initialize before other calls, no calls after shutdown. |
| TOOL | tools/list, tools/call, schema advertised vs returned. |
| RES | resources/list, resources/read, URI handling. |
| PROMPT | prompts/list, prompts/get, argument substitution. |
| AUTH | OAuth flows, bearer header shape, refresh on 401. |
| TRANSPORT | stdio framing vs HTTP status codes and chunked transfer. |
| HEADER | Required and forbidden HTTP headers per spec. |
| REPLAY | Cassette and replay determinism. |
| EDGE | Empty arrays, oversized payloads, Unicode names. |
Newer protocol features add their own prefixes (ELICIT, COMPL, RESTPL, PROG, ROOTS, SAMPLE, and COMPAT), each gated on the matching server capability so the rule only runs when the server advertises the feature. The corpus README in compliance/README.md documents how a rule is authored.
Each rule carries an RFC 2119 severity:
- MUST: the server is non-conforming if the rule fails.
- SHOULD: a warning by default.
--strictpromotes SHOULD failures to exit-code failures. - MAY: informational. A failure is recorded but never flips the exit code.
Run the suite with mcptest compliance. The companion compliance baseline lets you accept known failures without losing the green build.
Rule IDs are stable. The naming is locked so a rule called PROTO-007 today means the same check next year. Numbers are never reused; a retired rule keeps its ID with a deprecation stamp.
8. Reporters
A reporter turns a finished run into output. --reporter picks the format and --output picks the sink: a file path, or stdout when omitted or set to -. Every format works on both surfaces, so you can render directly from a run or re-render a saved record later with mcptest report --format <FORMAT>.
pretty(default). ANSI-colored terminal output, grouped by file. Failure summaries print at the bottom with file:line links.jsonwrites the full run record (tests, matchers, durations, errors). The shape is documented underschemas/run-record.v1.json, and ajsonfile is the canonical recordmcptest reportre-renders.junitwrites JUnit XML that GitHub Actions, GitLab, Jenkins, CircleCI, and most other CI systems render as a test report tab.mdwrites a Markdown summary suitable for a GitHub PR comment or a job summary.htmlwrites a self-contained HTML report with collapsible sections per test, useful as a build artifact.sarifwrites a SARIF 2.1.0 file. GitHub Code Scanning ingests this directly and annotates the offending files in the PR diff.gitlabwrites GitLab's native Code Quality JSON, which the merge request view renders inline.ndjsonwrites one JSON record per line (atestrecord per result, then asummary). Stream it into a log pipeline orjq -c.tapwrites Test Anything Protocol v14 forprove/tappy-style consumers.quietprints nothing; only the exit code matters. (Run-only.)uploadposts the run record to a configured collector. Report-only; OSS users will rarely use this.
# Render any format straight from a run, to stdout or a file.
mcptest run --reporter junit --output reports/results.xml
mcptest run --reporter tap # TAP to stdout
# Or run once, save the JSON record, then re-render any format from it.
mcptest run --reporter json --output reports/run.json
mcptest report reports/run.json --format sarif --output reports/results.sarif
The report formats all read from the same run record, so they cannot drift from each other. If a test passes in one format and fails in another you have hit a bug, file it.
GitHub Actions annotations
--annotations composes inline GitHub Actions annotations on top of any format: the report goes to its chosen sink, and one ::error/::warning workflow command per failure goes to stderr, which Actions turns into inline annotations on the PR diff. auto (the default) emits only inside Actions, always forces them, never disables them.
mcptest run --reporter junit --output results.xml --annotations always
9. Server target sources
A run needs to know which MCP server to talk to. mcptest resolves the target from a chain of sources, with the higher-listed source winning when more than one defines the same server. The precedence:
- CLI flags.
--server name=command...or--server name=url. Direct flags always win; this is the lever you use in CI to override a checked-in test file without editing it. --server-config <file>. A YAML file you point at with the flag. Overrides everything below.mcptest.local.ymlin the working directory. This file is gitignored by convention and holds developer-local targets (different ports, alternate auth tokens).mcptest.ymlin the working directory. The committed config.servers:block in the test file itself. The fallback. Useful for single-file examples and small repos.
The same precedence applies per server name. If mcptest.yml declares local as a subprocess server and you pass --server local=https://staging.example.com, the CLI flag wins for that name; everything else still loads from the file.
Run mcptest doctor to print the resolved target for each server along with the source it came from. When something is talking to the wrong place, doctor will tell you why.
10. Variable resolution
Tests interpolate ${variable_name} in their YAML. The interpolation engine resolves names against a stack of sources, again with the higher-listed source winning:
--var KEY=VALUEon the CLI. Repeatable.--env-file <path>on the CLI. Repeatable; later files override earlier ones.- OS environment variables present in the process.
.env.localin the working directory..env.testin the working directory..envin the working directory.variables:block in the test file.
Lookup walks the list top-down for each ${name}. The first source that defines the name wins; the rest are ignored for that name.
The split between --env-file and .env.local is intentional. --env-file is for explicit CI loads ("this pipeline uses these env vars"). .env.local is for the developer's machine. .env.test and .env are committed defaults.
Variables can reference each other within the same source as long as the reference is fully resolvable at load time; circular references fail the run with a typed error.
Secrets resolve through the same engine. A field with the suffix *_Env: API_KEY reads the value of the env variable API_KEY through the same precedence stack. The runner redacts the resolved value from logs and reporters by default; see security/redaction.md.
11. Auth model
mcptest knows three ways to authenticate to an MCP server.
bearer_token_env: SOMETHINGreads$SOMETHINGfrom the variable stack and sends it asAuthorization: Bearer <value>on every request. Simplest case; covers static tokens, PATs, and most internal-tools auth.- OAuth 2.1 with PKCE. Declare
auth: { oauth: { ... } }on the server block with issuer, client id, scopes, and redirect URI. mcptest performs the full PKCE dance on first run, stores the resulting refresh token under~/.mcptest/auth/<server>/, and hands the access token to every subsequent request. - Custom headers with the
*-Envsuffix. Insideauth.headers:you can listX-API-Key-Env: MY_KEY, and the runner replaces the value at request time with$MY_KEYfrom the variable stack. Anything with the-Envsuffix is treated as a secret and redacted in logs.
On a 401 response the runner does a single refresh-on-401 retry: for OAuth it asks the token endpoint for a new access token with the stored refresh token; for bearer-token mode it re-reads the env variable in case the user rotated it mid-run; for custom headers it re-resolves the *-Env values. The retry happens once per request. If the refreshed call also fails with 401 the test fails with a clear "auth still failed after refresh" error. See auth-refresh.md for the state machine.
Auth state is per-server, not per-test. Tests against the same server within a single run share the same access token (and the same refresh, if it triggers).
12. Exit codes
mcptest exits with a small fixed set of codes. CI systems read them to decide whether to fail the build and which classification to apply. The full table:
| Code | Meaning |
|---|---|
| 0 | All tests passed (or the command did what it was asked). |
| 1 | One or more tests failed (matcher mismatch, eval verdict fail, compliance regression). |
| 2 | Config error or invalid arguments (YAML did not load, schema validation failed, unknown flag). |
| 5 | Cost cap exceeded (--max-cost on eval), or --update-snapshots refused under CI=true. |
| 6 | Coverage below --coverage-threshold, or a model-compat DRIFT. |
| 7 | No tests selected (empty suite, or --filter/--shard/--last-failed matched nothing) and --pass-with-no-tests was not set. |
Codes outside this set are reserved. Codes 3 and 4 are not wired in the v1.0 binary, so do not write CI that special-cases them; a transport or auth failure surfaces as exit 1 with a diagnostic message. The full table, with the subcommand that returns each code, is in the CLI reference.
The CI integration guide (ci-integration.md) shows how each major CI provider surfaces these codes.
13. Agent end-to-end tests
The tool / resource / prompt test types check the protocol surface ("given args, expect this response shape"). The agent test type goes further: it points a real model at one or more MCP servers, drives the conversation, and asserts against the resulting trace. That covers the loop a real user hits, not just the wire format.
A minimal agent test:
servers:
weather:
command: ["./weather-server"]
agents:
- name: weather query routes to get_weather
model: claude-sonnet-4-5
servers: [weather]
prompt: What is the weather in Sacramento?
expect:
- target: tool_calls[0].name
matcher: { exact: get_weather }
- target: tool_calls[0].args.city
matcher: { regex: "(?i)sacramento" }
- target: final_response
matcher: { contains: Sacramento }
- target: conversation.tokens.total
matcher: { regex: "^[0-9]+$" }
The matcher target grammar resolves against the conversation trace:
tool_calls[i].nameandtool_calls[i].args.<path>- which tool the model picked and what it passed.tool_calls[i].server- which MCP server handled the call (set even on single-server runs).tool_results[i].isError,tool_results[i].content[0].text, etc - the verbatim JSON the server returned.final_response- the model's last plain-text reply.conversation.tokens.total,conversation.tokens.prompt,conversation.tokens.completion- cumulative usage from the provider.conversation.duration_ms,conversation.message_count- run telemetry.
The driver walks the conversation: it lists tools on every server in servers:, sends the prompt to the model with the merged tool catalog attached, dispatches each tool the model calls back to the owning server, and stops when the model returns plain text or the max_turns: cap trips.
Three flavors of agent test
- Single model, single server (the example above): deterministic sanity check that the model still picks the right tool.
Model matrix via
models:: one test, many models, pass / fail per model in the report. The use case is "when a new Claude or GPT or Gemini drops, do my existing tests still pass?"agents: - name: weather query models: - claude-sonnet-4-5 - gpt-5 - gemini-2.5-pro servers: [weather] prompt: What is the weather in Sacramento? expect: - target: tool_calls[0].name matcher: { exact: get_weather }Each
(test, model)pair becomes its own row in the report:mcptest::weather query [claude-sonnet-4-5] PASS. The reporter shows which assertion broke for which model.Multi-server agent: one model, tools from every named server. Useful when the real workflow spans multiple servers (issues + notifications, search + crawl, calendar + email).
agents: - name: open issue and notify oncall model: claude-sonnet-4-5 servers: [issues, notifications] prompt: Open a P1 issue for the failing CI run and ping #oncall. expect: - target: tool_calls[0].server matcher: { exact: issues } - target: tool_calls[1].server matcher: { exact: notifications }
Providers and credentials
Auto-detected families:
| Family | Detected when model starts with | Env var |
|---|---|---|
| Anthropic | claude- | ANTHROPIC_API_KEY |
| OpenAI | gpt-, chatgpt-, o<digit>, text-, davinci- | OPENAI_API_KEY (+ optional OPENAI_ORG_ID) |
gemini-, models/gemini- | GEMINI_API_KEY (or GOOGLE_API_KEY) | |
| Mistral | mistral-, codestral-, magistral-, ministral-, devstral-, pixtral-, open-mistral-, open-mixtral- | MISTRAL_API_KEY |
For Azure OpenAI, OpenRouter, vLLM, llama.cpp, LiteLLM, Together, Groq, Anyscale, Fireworks, or any other OpenAI-API-compatible endpoint, declare a named provider under top-level providers: and reference it from models::
providers:
openrouter:
type: openai
base_url: https://openrouter.ai/api/v1
api_key_env: OPENROUTER_API_KEY
agents:
- name: my test
models:
- { provider: openrouter, id: anthropic/claude-3.5-sonnet }
...
When the env var for a model is missing, the run falls through to a deterministic stub provider so CI stays green; the reporter logs which provider was actually used so you can spot a partial matrix.
Cassettes and CI
Each (agent, model) (and (agent, provider, model) for named providers) gets its own cassette at cassettes/<agent_slug>__[<provider>__]<model_slug>.json. The workflow:
# Once, with whatever keys you have set:
mcptest run --record
# Every subsequent run replays the cassettes; no API key needed,
# no network calls to the provider, deterministic per-test cost
# of zero cents.
mcptest run
A stale cassette (the model id or prompt changed in YAML) surfaces as a cassette stale on field <field> error before any matchers run, so you cannot silently pass against an out-of-date recording.
Two extra matchers: llm-judge and llm-jury
Sometimes the assertion you want is "did the model actually answer the question, in a way that stays grounded in the tool result?" That is not expressible as a regex. Two matchers fill the gap:
llm-judgeruns a separate model over the candidate text with a rubric and a pass threshold.llm-juryruns N independent judges with a quorum requirement, plus an inter-juror agreement metric the reporter surfaces.
Worked example: examples/agent-llm-judge.yml.
Spend caps
The top-level budget: block caps the dollars an agent run can spend per test and per suite. Both fields are USD cents, both default to unbounded.
budget:
per_test_usd_cents: 50
per_suite_usd_cents: 500
The runner stops the agent loop and surfaces a clear error when a cap trips, before the provider issues a surprise bill.
Further reading
Full docs in docs/models.md. Worked examples under examples/: agent-weather.yml (singleton), agent-matrix.yml (matrix), agent-custom-providers.yml (named providers), agent-llm-judge.yml (judge / jury), agent-issues-and-notifications.yml (multi-server).
14. The seven failure layers for URL targets
When a server target is an HTTP URL, opening the transport can fail in seven distinct places. mcptest reports the specific layer that failed so you do not have to guess. The layers, in the order the runner walks them:
- DNS. Hostname resolution failed (NXDOMAIN, timeout, etc.). Symptom:
error: failed to resolve <host>. - TCP. DNS resolved but the connection was refused or timed out. Symptom:
error: tcp connect failed. - TLS. TCP connected but the TLS handshake failed (expired cert, name mismatch, unsupported protocol). Symptom:
error: tls handshake failed. mcptest uses rustls; native-tls is not available. - HTTP. TLS finished but the HTTP request returned an unexpected status before MCP semantics applied (502 from a gateway, 404 at the wrong path). Symptom:
error: http <status> from <url>. - Authentication. HTTP succeeded but the server returned 401 or 403 and refresh-on-401 did not recover. Symptom:
error: authentication failed after refresh. - MCP initialize. Auth succeeded but the MCP
initializeexchange failed (missing fields, version mismatch, protocol error). Symptom:error: mcp initialize rejected: <reason>. - Readiness. Initialize succeeded but the server reported
ready: falseor did not advertise the capabilities the test requires. Symptom:error: server initialized but reported not ready.
Each layer maps to a different fix. DNS is your hostname; TCP is your firewall or process; TLS is your cert; HTTP is your gateway; auth is your credentials; initialize is your MCP implementation; readiness is your startup sequence.
mcptest doctor --url <url> walks the same layers in order and prints a one-line result per layer, so you can see how far you got without running an actual test.
Subprocess (stdio) servers fail in fewer places: spawn, stdio framing, MCP initialize, readiness. Same idea, four layers instead of seven.
15. Where to next?
This page is the overview. The detail is split across these pages.
- Tutorials, install paths, your first test. Start at
getting-started.md. It walks the hello-world flow end-to-end and points back here when concepts come up. - The complete YAML grammar. Open
yaml-reference.md. Every field, every default, every cross-link from this page. - Wiring mcptest into CI. See
guides/ci-integration.mdfor recipes covering GitHub Actions, GitLab CI, CircleCI, and Jenkins including reporter selection, artifact upload, and badge generation. - When something is wrong.
troubleshooting.mdis organized by symptom; copy a verbatim error message into the page and the match should land on the right entry. - Why we built it this way. The spec references and prior-art notes live in
research-references.md. - Contributor rules.
AGENTS.mdis the rulebook for code, lint, docs, and tickets. Read it before touching the repo.
If you finish this page and still feel uncertain about which matcher to use, which reporter to wire up, or which exit code your CI should special-case, that is a docs gap. File an issue at https://github.com/soapbucket/mcptest/issues, link this page, and we will fill it in.