Testing popular MCP servers end to end
This guide walks through testing two widely used Model Context Protocol (MCP) servers with mcptest, from a clean checkout to a green run: the official filesystem server and the fetch server. Both suites call the servers' real tools, exercise a broad slice of mcptest's features, and need no application programming interface (API) key for the core runnable tests.
The two suites live under examples/real-world/:
examples/real-world/filesystem/filesystem.ymlwith its README and recorded output.examples/real-world/fetch/fetch.ymlwith its README and recorded output.
More servers: the standalone mcptest-examples repo extends this guide to ten popular MCP servers, including git, SQLite, memory, the everything reference server, and the authenticated GitHub, Notion, and Brave Search servers (with account-creation steps). Each ships a README, a recorded run, and a CI workflow.
Everything below is reproducible. The example outputs shown here are real recorded runs against the actual servers, not hand-written samples.
A note before you start: what runs today
mcptest validates the full v1 YAML surface, and the test runner evaluates most matchers live. Two behaviors in the current build are worth knowing up front so the examples make sense:
- The
snapshotmatcher is parsed and schema-validated, but its live evaluation is part of the runner that is still being wired up. A live run reports a snapshot test as pending. Both suites include a snapshot test taggeddeferredand you skip it for a green run with--skip-tag deferred. - The runner does not yet substitute
${...}variable references into the server command or into tool arguments. Both suites therefore use literal paths and literal universal resource locators (URLs) in their tool arguments, and keep theirvariables:blocks for documentation and for the commented agent tests.
Neither limitation affects the headline result: each suite runs green against its real server today.
Prerequisites
- The
mcptestbinary. From the repository root it is built at./target/debug/mcptest. - For the filesystem server: Node.js with
npxon your PATH. The first run downloads the server package. - For the fetch server:
uvwithuvxon your PATH (see <https://docs.astral.sh/uv/>), plus outbound network access for the fetch tool tests.
Validate either suite at any time without a server present. Validation runs the published JSON Schema and nothing else:
mcptest validate --config examples/real-world/filesystem/filesystem.yml
mcptest validate --config examples/real-world/fetch/fetch.yml
Both print ok.
Part 1: the filesystem server
The official filesystem server is launched as a subprocess and speaks MCP over standard input and output (stdio):
npx -y @modelcontextprotocol/server-filesystem <allowed-dir>
Source: <https://github.com/modelcontextprotocol/servers/tree/main/src/filesystem>
It exposes a sandboxed set of file tools. The suite uses these real tools, every name and argument read from the server's own tools/list response: list_allowed_directories, list_directory, write_file, read_text_file, get_file_info, and directory_tree.
The server block
servers:
filesystem:
command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
env:
LOG_LEVEL: "warn"
The single positional argument /tmp is the directory the server is allowed to read and write. Everything outside it is denied, which is the basis for the negative test below.
One-time fixture setup
The read tests target a prepared fixture file so they do not depend on a file written earlier in the same run (tests run concurrently, so a write-then-read pair in two separate tests could race). directory_tree targets a small directory so its JSON Object Notation (JSON) output stays compact. Create them once:
printf 'hello world\nsecond line\nthird line\n' > /tmp/mcptest-fixture.txt
mkdir -p /tmp/mcptest-tree/sub
printf 'a' > /tmp/mcptest-tree/one.txt
printf 'b' > /tmp/mcptest-tree/sub/two.txt
macOS note: on macOS /tmp is a symbolic link to /private/tmp, and the server canonicalizes its allowed root to /private/tmp. Create the fixtures under /private/tmp instead, and point the server there at run time with the --server-command override shown later.
A matcher tour
The suite uses a deliberate spread of matchers. A few representative tests:
The exact matcher pins a value with strict equality. Here the first content block of a directory listing must be a text block:
- name: "lists the allowed directory without error"
server: filesystem
tool: "list_directory"
args:
path: "/tmp"
expect:
- target: "result.content"
matcher:
schema:
type: array
minItems: 1
- target: "result.content[0].type"
matcher:
exact: "text"
That same test shows the schema matcher: the content is validated as a non-empty array without pinning any file name.
Here the contains matcher checks object subset membership (a content block that includes {type: "text"}). The contains-all matcher then asserts every listed substring is present in the echoed path; a plain contains: would match a single case-sensitive substring:
- name: "writes a file and confirms the path"
server: filesystem
tool: "write_file"
args:
path: "/tmp/mcptest-write.txt"
content: "written by mcptest"
expect:
- target: "result.content[0]"
matcher:
contains:
type: "text"
- target: "result.content[0].text"
matcher:
contains-all: ["Successfully wrote to", "/tmp/mcptest-write.txt"]
The regex matcher pins a pattern instead of a literal value. The get_file_info tool returns a small key/value block, so the size line and the isFile flag are matched with patterns:
- name: "reports file info with a size and isFile flag"
server: filesystem
tool: "get_file_info"
args:
path: "/tmp/mcptest-fixture.txt"
expect:
- target: "result.content[0].text"
matcher:
regex: "size: [0-9]+"
- target: "result.content[0].text"
matcher:
regex: "isFile: true"
The is-json matcher parses a string target and validates the parsed document against an inline schema. The directory_tree tool returns a JSON array of nodes as its text, so it is a natural fit:
- name: "directory_tree returns a JSON tree"
server: filesystem
tool: "directory_tree"
args:
path: "/tmp/mcptest-tree"
expect:
- target: "result.content[0].text"
matcher:
is-json:
schema:
type: array
items:
type: object
required: ["name", "type"]
The suite also uses icontains (case-insensitive substring), starts-with (prefix), levenshtein (a "close enough" edit-distance check), and not (universal negation) on other tools. See the file for all of them.
The negative test
Reading a path outside the allowed root must fail. The server sets result.isError to true and explains why:
- name: "denies a read outside the allowed root"
server: filesystem
tool: "read_text_file"
args:
path: "/etc/hosts"
expect:
- target: "result.isError"
matcher:
exact: true
- target: "result.content[0].text"
matcher:
regex: "Access denied|outside allowed director"
- target: "result.content[0].text"
matcher:
not:
contains-all: ["Successfully"]
Compliance, tool quality, and performance
The suite asserts protocol behavior with a compliance: block. The initialize check confirms the server negotiates a date-shaped protocol version and advertises a tools capability; the tools/list check asserts the catalog schema:
compliance:
- name: "negotiates capabilities on initialize"
server: filesystem
check: "initialize"
expect:
- target: "result.protocolVersion"
matcher:
regex: "^2\\d{3}-\\d{2}-\\d{2}$"
- target: "result.capabilities"
matcher:
schema:
type: object
required: ["tools"]
- name: "advertises the filesystem tool catalog"
server: filesystem
check: "tools/list"
expect:
- target: "result.tools"
matcher:
schema:
type: array
minItems: 1
items:
type: object
required: ["name", "description", "inputSchema"]
The tool_quality: block scores the server's tool descriptions with the deterministic Tool Description Quality Score (TDQS) heuristics and gates on the worst tool's score (min_score), the average (mean_score), and the count of critical lint findings (critical_count):
tool_quality:
- name: "filesystem tool descriptions meet the quality bar"
server: filesystem
expect:
- target: min_score
matcher: { schema: { minimum: 0.30 } }
- target: mean_score
matcher: { schema: { minimum: 0.50 } }
- target: critical_count
matcher: { schema: { maximum: 0 } }
The top-level performance: block sets a default per-test timeout and an advisory 95th-percentile (p95) latency budget. The advisory budget highlights slow tests in the report but does not by itself fail the run:
performance:
default_timeout_ms: 30000
p95_latency_ms: 2000
Running it
On Linux or in continuous integration (CI), where the allowed root is /tmp:
mcptest run --config examples/real-world/filesystem/filesystem.yml --skip-tag deferred
On macOS, point the server at the canonical /private/tmp root and create the fixtures there first:
mcptest run --config examples/real-world/filesystem/filesystem.yml \
--skip-tag deferred --reporter minimal \
--server-command "npx -y @modelcontextprotocol/server-filesystem /private/tmp"
The recorded result (a real run, captured on macOS with the override above; --reporter minimal prints the compact one-line summary instead of the default per-test listing):
Server-target override applied: --server-command
ran 11 tests: 11 passed, 0 failed, 0 skipped (9ms)
The full recorded output, including the pending snapshot test from a run without --skip-tag deferred, is in example-output.txt.
Part 2: the fetch server
The fetch server is also a stdio subprocess, launched with uvx:
uvx mcp-server-fetch
Source: <https://github.com/modelcontextprotocol/servers/tree/main/src/fetch>
It exposes a single tool, fetch, whose argument shape was read from the server's own tools/list response:
url(string, required): the URL to fetch.max_length(integer, default 5000): the maximum number of characters returned.start_index(integer, default 0): the character offset to start from, useful for paging through a truncated body.raw(boolean, default false): return the raw HyperText Markup Language (HTML) instead of simplified markdown.
By default the server converts a page to markdown and prefixes the body with a line Contents of <url>:. A successful call sets result.isError to false.
The server block
servers:
fetch:
command: ["uvx", "mcp-server-fetch"]
A matcher tour
The fetch tool makes real outbound HTTP requests, so those tests are tagged network. The suite uses example.com, the canonical license-free test domain.
The starts-with matcher pins the markdown banner; exact confirms the call did not error:
- name: "fetches example.com as markdown"
server: fetch
tool: "fetch"
args:
url: "https://example.com"
max_length: 500
tags: ["network"]
expect:
- target: "result.isError"
matcher:
exact: false
- target: "result.content[0].text"
matcher:
starts-with: "Contents of https://example.com/:"
The regex matcher pins the markdown link the page renders to iana.org:
- name: "renders a markdown link to iana.org"
server: fetch
tool: "fetch"
args:
url: "https://example.com"
max_length: 500
tags: ["network"]
expect:
- target: "result.content[0].text"
matcher:
regex: "\\[Learn more\\]\\(https://[^)]*iana\\.org"
The suite covers the max_length and start_index arguments directly. A small max_length truncates the body, and the server appends a note telling the caller which start_index to use next:
- name: "honors a small max_length and marks truncation"
server: fetch
tool: "fetch"
args:
url: "https://example.com"
max_length: 80
start_index: 0
tags: ["network"]
expect:
- target: "result.content[0].text"
matcher:
contains-all: ["Content truncated", "start_index"]
A non-zero start_index skips the opening of the page, asserted with the not matcher:
- name: "honors start_index by skipping the opening"
server: fetch
tool: "fetch"
args:
url: "https://example.com"
max_length: 200
start_index: 60
tags: ["network"]
expect:
- target: "result.content[0].text"
matcher:
not:
contains-all: ["This domain is for use"]
The suite also uses schema, contains-all, contains-any, and icontains on the fetched body. See the file for all of them.
The negative test (offline-safe)
A malformed URL is rejected by the server's own argument validation before any network call, so this test runs even with no network:
- name: "rejects a malformed URL"
server: fetch
tool: "fetch"
args:
url: "not-a-valid-url"
expect:
- target: "result.isError"
matcher:
exact: true
- target: "result.content[0].text"
matcher:
regex: "validation error|valid URL|url_parsing"
Compliance, tool quality, and performance
The fetch suite includes the same kinds of compliance:, tool_quality:, and performance: blocks as the filesystem suite. The tools/list compliance check asserts the catalog advertises the fetch tool with a name, a description, and an input schema; the tool_quality: block scores the single tool's description; the performance: block sets a larger advisory p95 budget because network fetches are slower and more variable than local file calls:
performance:
default_timeout_ms: 30000
p95_latency_ms: 5000
Running it
Run the deterministic core (skips the snapshot test):
mcptest run --config examples/real-world/fetch/fetch.yml --skip-tag deferred --reporter minimal
The recorded result (a real run, captured with outbound network access; --reporter minimal prints the compact one-line summary):
ran 10 tests: 10 passed, 0 failed, 0 skipped (3093ms)
The roughly three-second wall time is dominated by the live HTTP fetches, not by mcptest. To run offline, drop the network tests with a second --skip-tag; the malformed-URL test still runs:
mcptest run --config examples/real-world/fetch/fetch.yml \
--skip-tag deferred --skip-tag network --reporter minimal
ran 2 tests: 2 passed, 0 failed, 0 skipped (1ms)
The full recorded output, including the pending snapshot test, is in example-output.txt.
Snapshot tests and agent tests
Each suite includes one snapshot test, tagged deferred. The snapshot matcher records a value on its first wired run and diffs against the recording on later runs. It is parsed and schema-validated today, and its live evaluation lands with the runner that is still being wired up, so a full run reports it as pending. Skip it with --skip-tag deferred for a green run; the test stays in the file so the suite is ready the moment live evaluation lands. The filesystem snapshot pins the allowed-directories banner; the fetch snapshot pins the example.com markdown.
Each suite also ends with a commented-out agents: block. An agent test runs a real language model against the server and asserts on the resulting tool calls. It is commented out because it needs an ANTHROPIC_API_KEY and its result is model-dependent, so it is not part of the deterministic core. Uncomment it and export the key to watch the model route to a tool. For the filesystem server the model should pick list_directory or directory_tree; for the fetch server it should call fetch with the URL. Here is the filesystem agent test as a literal example:
agents:
- name: "model lists the directory when asked what is in it"
model: claude-sonnet-4-5
servers: [filesystem]
prompt: "What files are in ${base_dir}? Use the tools available to you."
max_turns: 4
expect:
- target: tool_calls[0].name
matcher:
contains-any: ["list_directory", "directory_tree", "read_text_file"]
- target: tool_calls[0].server
matcher:
exact: filesystem
Feature coverage at a glance
Both suites exercise the same broad feature set, applied to each server's real tools.
| Feature | Filesystem suite | Fetch suite |
|---|---|---|
stdio servers: entry | npx filesystem server | uvx fetch server |
variables: block | literal and environment-backed | one literal value |
exact matcher | yes | yes |
contains (object subset) | yes | no |
contains-all / contains-any | yes | yes |
icontains | yes | yes |
starts-with | no | yes |
regex | yes | yes |
schema | yes | yes |
is-json | yes | no (fetch returns markdown) |
levenshtein | yes | no |
not | yes | yes |
Negative test (result.isError: true) | read outside sandbox | malformed URL |
compliance: initialize | yes | yes |
compliance: tools/list | yes | yes |
Snapshot test (tagged deferred) | yes | yes |
tool_quality: block | yes | yes |
performance: budget | yes | yes |
Optional agents: block (needs API key) | commented | commented |
See also
- YAML test format reference, the full field-by-field reference.
- Snapshot tests.
- Compliance baseline.
- Agent and model testing.
- Docker and package runners, for running servers that ship as containers or packages.