Testing popular MCP servers end to end

This guide walks through testing two widely used Model Context Protocol (MCP) servers with mcptest, from a clean checkout to a green run: the official filesystem server and the fetch server. Both suites call the servers' real tools, exercise a broad slice of mcptest's features, and need no application programming interface (API) key for the core runnable tests.

The two suites live under examples/real-world/:

examples/real-world/filesystem/filesystem.yml with its README and recorded output.
examples/real-world/fetch/fetch.yml with its README and recorded output.

More servers: the standalone mcptest-examples repo extends this guide to ten popular MCP servers, including git, SQLite, memory, the everything reference server, and the authenticated GitHub, Notion, and Brave Search servers (with account-creation steps). Each ships a README, a recorded run, and a CI workflow.

Everything below is reproducible. The example outputs shown here are real recorded runs against the actual servers, not hand-written samples.

A note before you start: what runs today

mcptest validates the full v1 YAML surface, and the test runner evaluates most matchers live. Two behaviors in the current build are worth knowing up front so the examples make sense:

The snapshot matcher is parsed and schema-validated, but its live evaluation is part of the runner that is still being wired up. A live run reports a snapshot test as pending. Both suites include a snapshot test tagged deferred and you skip it for a green run with --skip-tag deferred.
The runner does not yet substitute ${...} variable references into the server command or into tool arguments. Both suites therefore use literal paths and literal universal resource locators (URLs) in their tool arguments, and keep their variables: blocks for documentation and for the commented agent tests.

Neither limitation affects the headline result: each suite runs green against its real server today.

Prerequisites

The mcptest binary. From the repository root it is built at ./target/debug/mcptest.
For the filesystem server: Node.js with npx on your PATH. The first run downloads the server package.
For the fetch server: uv with uvx on your PATH (see <https://docs.astral.sh/uv/>), plus outbound network access for the fetch tool tests.

Validate either suite at any time without a server present. Validation runs the published JSON Schema and nothing else:

mcptest validate --config examples/real-world/filesystem/filesystem.yml
mcptest validate --config examples/real-world/fetch/fetch.yml

Both print ok.

Part 1: the filesystem server

The official filesystem server is launched as a subprocess and speaks MCP over standard input and output (stdio):

npx -y @modelcontextprotocol/server-filesystem <allowed-dir>

Source: <https://github.com/modelcontextprotocol/servers/tree/main/src/filesystem>

It exposes a sandboxed set of file tools. The suite uses these real tools, every name and argument read from the server's own tools/list response: list_allowed_directories, list_directory, write_file, read_text_file, get_file_info, and directory_tree.

The server block

servers:
  filesystem:
    command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
    env:
      LOG_LEVEL: "warn"

The single positional argument /tmp is the directory the server is allowed to read and write. Everything outside it is denied, which is the basis for the negative test below.

One-time fixture setup

The read tests target a prepared fixture file so they do not depend on a file written earlier in the same run (tests run concurrently, so a write-then-read pair in two separate tests could race). directory_tree targets a small directory so its JSON Object Notation (JSON) output stays compact. Create them once:

printf 'hello world\nsecond line\nthird line\n' > /tmp/mcptest-fixture.txt
mkdir -p /tmp/mcptest-tree/sub
printf 'a' > /tmp/mcptest-tree/one.txt
printf 'b' > /tmp/mcptest-tree/sub/two.txt

macOS note: on macOS /tmp is a symbolic link to /private/tmp, and the server canonicalizes its allowed root to /private/tmp. Create the fixtures under /private/tmp instead, and point the server there at run time with the --server-command override shown later.

A matcher tour

The suite uses a deliberate spread of matchers. A few representative tests:

The exact matcher pins a value with strict equality. Here the first content block of a directory listing must be a text block:

- name: "lists the allowed directory without error"
  server: filesystem
  tool: "list_directory"
  args:
    path: "/tmp"
  expect:
    - target: "result.content"
      matcher:
        schema:
          type: array
          minItems: 1
    - target: "result.content[0].type"
      matcher:
        exact: "text"

That same test shows the schema matcher: the content is validated as a non-empty array without pinning any file name.

Here the contains matcher checks object subset membership (a content block that includes {type: "text"}). The contains-all matcher then asserts every listed substring is present in the echoed path; a plain contains: would match a single case-sensitive substring:

- name: "writes a file and confirms the path"
  server: filesystem
  tool: "write_file"
  args:
    path: "/tmp/mcptest-write.txt"
    content: "written by mcptest"
  expect:
    - target: "result.content[0]"
      matcher:
        contains:
          type: "text"
    - target: "result.content[0].text"
      matcher:
        contains-all: ["Successfully wrote to", "/tmp/mcptest-write.txt"]

The regex matcher pins a pattern instead of a literal value. The get_file_info tool returns a small key/value block, so the size line and the isFile flag are matched with patterns:

- name: "reports file info with a size and isFile flag"
  server: filesystem
  tool: "get_file_info"
  args:
    path: "/tmp/mcptest-fixture.txt"
  expect:
    - target: "result.content[0].text"
      matcher:
        regex: "size: [0-9]+"
    - target: "result.content[0].text"
      matcher:
        regex: "isFile: true"

The is-json matcher parses a string target and validates the parsed document against an inline schema. The directory_tree tool returns a JSON array of nodes as its text, so it is a natural fit:

- name: "directory_tree returns a JSON tree"
  server: filesystem
  tool: "directory_tree"
  args:
    path: "/tmp/mcptest-tree"
  expect:
    - target: "result.content[0].text"
      matcher:
        is-json:
          schema:
            type: array
            items:
              type: object
              required: ["name", "type"]

The suite also uses icontains (case-insensitive substring), starts-with (prefix), levenshtein (a "close enough" edit-distance check), and not (universal negation) on other tools. See the file for all of them.

The negative test

Reading a path outside the allowed root must fail. The server sets result.isError to true and explains why:

- name: "denies a read outside the allowed root"
  server: filesystem
  tool: "read_text_file"
  args:
    path: "/etc/hosts"
  expect:
    - target: "result.isError"
      matcher:
        exact: true
    - target: "result.content[0].text"
      matcher:
        regex: "Access denied|outside allowed director"
    - target: "result.content[0].text"
      matcher:
        not:
          contains-all: ["Successfully"]

Compliance, tool quality, and performance

The suite asserts protocol behavior with a compliance: block. The initialize check confirms the server negotiates a date-shaped protocol version and advertises a tools capability; the tools/list check asserts the catalog schema:

compliance:
  - name: "negotiates capabilities on initialize"
    server: filesystem
    check: "initialize"
    expect:
      - target: "result.protocolVersion"
        matcher:
          regex: "^2\\d{3}-\\d{2}-\\d{2}$"
      - target: "result.capabilities"
        matcher:
          schema:
            type: object
            required: ["tools"]

  - name: "advertises the filesystem tool catalog"
    server: filesystem
    check: "tools/list"
    expect:
      - target: "result.tools"
        matcher:
          schema:
            type: array
            minItems: 1
            items:
              type: object
              required: ["name", "description", "inputSchema"]

The tool_quality: block scores the server's tool descriptions with the deterministic Tool Description Quality Score (TDQS) heuristics and gates on the worst tool's score (min_score), the average (mean_score), and the count of critical lint findings (critical_count):

tool_quality:
  - name: "filesystem tool descriptions meet the quality bar"
    server: filesystem
    expect:
      - target: min_score
        matcher: { schema: { minimum: 0.30 } }
      - target: mean_score
        matcher: { schema: { minimum: 0.50 } }
      - target: critical_count
        matcher: { schema: { maximum: 0 } }

The top-level performance: block sets a default per-test timeout and an advisory 95th-percentile (p95) latency budget. The advisory budget highlights slow tests in the report but does not by itself fail the run:

performance:
  default_timeout_ms: 30000
  p95_latency_ms: 2000

Running it

On Linux or in continuous integration (CI), where the allowed root is /tmp:

mcptest run --config examples/real-world/filesystem/filesystem.yml --skip-tag deferred

On macOS, point the server at the canonical /private/tmp root and create the fixtures there first:

mcptest run --config examples/real-world/filesystem/filesystem.yml \
  --skip-tag deferred --reporter minimal \
  --server-command "npx -y @modelcontextprotocol/server-filesystem /private/tmp"

The recorded result (a real run, captured on macOS with the override above; --reporter minimal prints the compact one-line summary instead of the default per-test listing):

Server-target override applied: --server-command
ran 11 tests: 11 passed, 0 failed, 0 skipped (9ms)

The full recorded output, including the pending snapshot test from a run without --skip-tag deferred, is in example-output.txt.

Part 2: the fetch server

The fetch server is also a stdio subprocess, launched with uvx:

uvx mcp-server-fetch

Source: <https://github.com/modelcontextprotocol/servers/tree/main/src/fetch>

It exposes a single tool, fetch, whose argument shape was read from the server's own tools/list response:

url (string, required): the URL to fetch.
max_length (integer, default 5000): the maximum number of characters returned.
start_index (integer, default 0): the character offset to start from, useful for paging through a truncated body.
raw (boolean, default false): return the raw HyperText Markup Language (HTML) instead of simplified markdown.

By default the server converts a page to markdown and prefixes the body with a line Contents of <url>:. A successful call sets result.isError to false.

The server block

servers:
  fetch:
    command: ["uvx", "mcp-server-fetch"]

A matcher tour

The fetch tool makes real outbound HTTP requests, so those tests are tagged network. The suite uses example.com, the canonical license-free test domain.

The starts-with matcher pins the markdown banner; exact confirms the call did not error:

- name: "fetches example.com as markdown"
  server: fetch
  tool: "fetch"
  args:
    url: "https://example.com"
    max_length: 500
  tags: ["network"]
  expect:
    - target: "result.isError"
      matcher:
        exact: false
    - target: "result.content[0].text"
      matcher:
        starts-with: "Contents of https://example.com/:"

The regex matcher pins the markdown link the page renders to iana.org:

- name: "renders a markdown link to iana.org"
  server: fetch
  tool: "fetch"
  args:
    url: "https://example.com"
    max_length: 500
  tags: ["network"]
  expect:
    - target: "result.content[0].text"
      matcher:
        regex: "\\[Learn more\\]\\(https://[^)]*iana\\.org"

The suite covers the max_length and start_index arguments directly. A small max_length truncates the body, and the server appends a note telling the caller which start_index to use next:

- name: "honors a small max_length and marks truncation"
  server: fetch
  tool: "fetch"
  args:
    url: "https://example.com"
    max_length: 80
    start_index: 0
  tags: ["network"]
  expect:
    - target: "result.content[0].text"
      matcher:
        contains-all: ["Content truncated", "start_index"]

A non-zero start_index skips the opening of the page, asserted with the not matcher:

- name: "honors start_index by skipping the opening"
  server: fetch
  tool: "fetch"
  args:
    url: "https://example.com"
    max_length: 200
    start_index: 60
  tags: ["network"]
  expect:
    - target: "result.content[0].text"
      matcher:
        not:
          contains-all: ["This domain is for use"]

The suite also uses schema, contains-all, contains-any, and icontains on the fetched body. See the file for all of them.

The negative test (offline-safe)

A malformed URL is rejected by the server's own argument validation before any network call, so this test runs even with no network:

- name: "rejects a malformed URL"
  server: fetch
  tool: "fetch"
  args:
    url: "not-a-valid-url"
  expect:
    - target: "result.isError"
      matcher:
        exact: true
    - target: "result.content[0].text"
      matcher:
        regex: "validation error|valid URL|url_parsing"

Compliance, tool quality, and performance

The fetch suite includes the same kinds of compliance:, tool_quality:, and performance: blocks as the filesystem suite. The tools/list compliance check asserts the catalog advertises the fetch tool with a name, a description, and an input schema; the tool_quality: block scores the single tool's description; the performance: block sets a larger advisory p95 budget because network fetches are slower and more variable than local file calls:

performance:
  default_timeout_ms: 30000
  p95_latency_ms: 5000

Running it

Run the deterministic core (skips the snapshot test):

mcptest run --config examples/real-world/fetch/fetch.yml --skip-tag deferred --reporter minimal

The recorded result (a real run, captured with outbound network access; --reporter minimal prints the compact one-line summary):

ran 10 tests: 10 passed, 0 failed, 0 skipped (3093ms)

The roughly three-second wall time is dominated by the live HTTP fetches, not by mcptest. To run offline, drop the network tests with a second --skip-tag; the malformed-URL test still runs:

mcptest run --config examples/real-world/fetch/fetch.yml \
  --skip-tag deferred --skip-tag network --reporter minimal

ran 2 tests: 2 passed, 0 failed, 0 skipped (1ms)

The full recorded output, including the pending snapshot test, is in example-output.txt.

Snapshot tests and agent tests

Each suite includes one snapshot test, tagged deferred. The snapshot matcher records a value on its first wired run and diffs against the recording on later runs. It is parsed and schema-validated today, and its live evaluation lands with the runner that is still being wired up, so a full run reports it as pending. Skip it with --skip-tag deferred for a green run; the test stays in the file so the suite is ready the moment live evaluation lands. The filesystem snapshot pins the allowed-directories banner; the fetch snapshot pins the example.com markdown.

Each suite also ends with a commented-out agents: block. An agent test runs a real language model against the server and asserts on the resulting tool calls. It is commented out because it needs an ANTHROPIC_API_KEY and its result is model-dependent, so it is not part of the deterministic core. Uncomment it and export the key to watch the model route to a tool. For the filesystem server the model should pick list_directory or directory_tree; for the fetch server it should call fetch with the URL. Here is the filesystem agent test as a literal example:

agents:
  - name: "model lists the directory when asked what is in it"
    model: claude-sonnet-4-5
    servers: [filesystem]
    prompt: "What files are in ${base_dir}? Use the tools available to you."
    max_turns: 4
    expect:
      - target: tool_calls[0].name
        matcher:
          contains-any: ["list_directory", "directory_tree", "read_text_file"]
      - target: tool_calls[0].server
        matcher:
          exact: filesystem

Feature coverage at a glance

Both suites exercise the same broad feature set, applied to each server's real tools.

Feature	Filesystem suite	Fetch suite
stdio `servers:` entry	`npx` filesystem server	`uvx` fetch server
`variables:` block	literal and environment-backed	one literal value
`exact` matcher	yes	yes
`contains` (object subset)	yes	no
`contains-all` / `contains-any`	yes	yes
`icontains`	yes	yes
`starts-with`	no	yes
`regex`	yes	yes
`schema`	yes	yes
`is-json`	yes	no (fetch returns markdown)
`levenshtein`	yes	no
`not`	yes	yes
Negative test (`result.isError: true`)	read outside sandbox	malformed URL
`compliance:` `initialize`	yes	yes
`compliance:` `tools/list`	yes	yes
Snapshot test (tagged `deferred`)	yes	yes
`tool_quality:` block	yes	yes
`performance:` budget	yes	yes
Optional `agents:` block (needs API key)	commented	commented

Testing popular MCP servers end to end

A note before you start: what runs today

Prerequisites

Part 1: the filesystem server

The server block

One-time fixture setup

A matcher tour

The negative test

Compliance, tool quality, and performance

Running it

Part 2: the fetch server

The server block

A matcher tour

The negative test (offline-safe)

Compliance, tool quality, and performance

Running it

Snapshot tests and agent tests

Feature coverage at a glance

See also