mcptest docs GitHub

CLI reference

Complete reference for every subcommand and global flag in the mcptest binary. Source of truth is crates/mcptest/src/cli/ (the Command enum in cli/mod.rs and the per-subcommand Args structs under cli/args/); this page mirrors that at the v1.0 cut. When a flag is wired to a stub handler, that is called out so a reader knows the implementation work is still pending.

For a friendlier introduction, start with getting-started.md. The YAML test format is documented in yaml-reference.md, and common failure modes are covered in troubleshooting.md.

Synopsis

mcptest [GLOBAL_OPTIONS] <SUBCOMMAND> [ARGS]

mcptest --help prints a clap-generated summary of every flag below. mcptest --version prints the build version. mcptest <SUBCOMMAND> --help prints the per-subcommand summary including any subcommand-specific flags.

Global flags are accepted before or after the subcommand name: both mcptest --debug run and mcptest run --debug parse identically.

Global options

These flags are declared on GlobalArgs in crates/mcptest/src/cli/global.rs. Every subcommand inherits them via #[command(flatten)], so they work uniformly regardless of which subcommand you call.

Output and logging

--no-color

--debug

--verbose

Logging

mcptest emits structured log events through tracing once the binary starts. The subscriber writes to stderr (stdout is reserved for reporter output) and is filtered by an EnvFilter resolved from the four sources below.

--log-level <LEVEL>

# Trace cache decisions for one debugging session.
mcptest --log-level "mcptest_core::cache=debug" run

# Quietest possible run; exit code is the only signal.
mcptest --log-level off run

Filter resolution precedence

Highest first:

  1. --log-level <VAL>
  2. RUST_LOG env var (standard convention).
  3. MCPTEST_LOG env var. Use this when a parent process sets RUST_LOG=trace for its own purposes and you do not want that flood to leak into mcptest output.
  4. --debug (back-compat: maps to mcptest=debug,mcptest_core=debug,mcptest_config=debug).
  5. --verbose (back-compat: maps to mcptest=info,mcptest_core=info,mcptest_config=info).
  6. Built-in default: warn.

Other logging knobs

What gets logged

Credentials are never logged. The connector and transport instrumentation redacts Authorization and any header name matching (?i)(token|secret|password|cookie|auth) to ***.

--quiet

--reporter <FORMAT>

--output <PATH>

--annotations <WHEN>

--color <WHEN>

Configuration sources

--config <PATH>

--env-file <PATH>

--no-env-file

--var KEY=VALUE

--show-secrets

Test selection and execution

--filter <EXPR>

--parallel <N>

--timeout <SECONDS>

--retry <N>

--watch

--wait-for-ready[=DURATION]

Server target overrides

These flags let you change the server: block in the YAML suite at run time. They are useful for preview deploys and CI matrices where the YAML is authored without knowing the target URL.

--server-url <URL>

--server-command <CMD>

--server-auth-bearer-env <NAME>

--server-config <PATH>

HTTP transport

--header NAME=VALUE

--header-env NAME=VAR_NAME

--insecure-skip-verify

--ca-bundle <PATH>

--http-timeout <SECONDS>

--connect-timeout <SECONDS>

Proxy

Proxy flags apply to every outbound HTTP client mcptest builds: the StreamableHTTP and legacy SSE transports for MCP servers, plus the LLM provider clients (Anthropic, OpenAI, Google, Mistral, and any custom OpenAI-compat provider declared under providers:).

When no flag is set, reqwest reads HTTP_PROXY, HTTPS_PROXY, and NO_PROXY from the environment, so users behind a corporate proxy who already export those variables get the right behavior without changing anything. Use the flags below to override or disable.

--proxy <URL>

--http-proxy <URL>

--https-proxy <URL>

--no-proxy

--noproxy <HOSTLIST>

Verify what is in effect with mcptest doctor (prints a one-line proxy: summary) or mcptest run --print-config (includes the same summary plus the resolved test list).

Upload reporter

These flags are read by mcptest report --format upload. They parse on every command but have no effect outside that path.

--upload-endpoint <URL>

--upload-token-env <NAME>

--upload-organization <NAME>

Subcommands

run

mcptest [GLOBAL_OPTIONS] run

Description. Run the test suite. This is the primary command and the one you will type most often.

The runner loads the YAML config (default ./mcptest.yaml or whatever --config points to), resolves variables, applies any server-target overrides, and executes each test against the configured server. Results are printed by the reporter selected with --reporter (default pretty).

Arguments.

ArgumentTypeRequiredDescription
--update-snapshots, -uflagoptionalRewrite every snapshot fixture encountered during the run.
--allow-update-in-ciflagoptionalPermit --update-snapshots even when CI=true is set.
--models <ID,ID,...>listoptionalRun the suite as a model matrix: every agent test fans across this comma-separated list (one cell per model), and the run defaults to the matrix reporter. See matrix-reporter.md.
--no-verdict-cacheflagoptionalDisable the LLM-judge verdict cache for this run. Overrides evals.cache.verdicts: true in YAML.
--coverageflagoptionalRecord per-tool, per-argument, per-error-path, and per-transport coverage during the run. Folds the result into JSON, pretty, markdown, and HTML reporters.
--coverage-threshold <SPEC>stringoptionalQuality gate against the coverage report. Accepts tools=80,args=60,error_paths=50,transports=100. Exits with code 6 when any dimension falls below its threshold. Requires --coverage.
--no-cacheflagoptionalBypass the content-addressed cache for this run. Equivalent to passing both --no-cache-read and --no-cache-write.
--no-cache-readflagoptionalWrite fresh entries but ignore existing ones. Equivalent to "refresh the cache on this run."
--no-cache-writeflagoptionalRead existing entries but do not update the cache. Equivalent to "freeze the cache for this run."
--cache-filter <SET>stringoptionalRestrict execution to tests matching the named cache set. NEW is the only value v1 ships.
--recordflagoptionalFor agent tests, dispatch every model live and overwrite the cassette on disk. Default behavior replays the cassette when present.
--bail, -xflagoptionalStop the runner after the first failing test. Subsequent tests are reported as skipped.
--maxfail <N>integeroptionalStop after the Nth failing test. Implies --bail when N=1.
--collect-only, --list-testsflagoptionalPrint discovered tests and exit 0 without running them. Honors --filter. Useful for verifying a selector before a long run.
--pass-with-no-testsflagoptionalTreat "zero tests selected" as success. Without this flag, a run that picks nothing (after --filter, --shard, or --last-failed) exits 7.
--shard <INDEX/TOTAL>stringoptionalRun a deterministic slice of the discovered tests. One-based; --shard 1/3 runs the first third. The partition is stable across runs. Pair with --pass-with-no-tests so a worker with an empty slice does not fail the CI matrix.
--last-failed, --lfflagoptionalRun only the tests that failed on the previous invocation. Reads .mcptest/last-run.json (rewritten after every run).
--failed-first, --ffflagoptionalReorder the test list so previous failures run first. Every selected test still runs. Mutually exclusive with --last-failed.
--print-configflagoptionalPrint the resolved suite (servers, providers, budget, selected tests) and exit. Provider API key env vars are listed by name, never resolved values.
--tag <NAME>string (repeatable)optionalRun only tests whose YAML tags: list contains NAME. Multiple --tag flags are OR'd.
--skip-tag <NAME>string (repeatable)optionalDrop tests whose tags: list contains NAME. Applied after --tag so a test matching both is dropped.
--randomflagoptionalShuffle the test order to surface hidden ordering dependencies. The seed is logged so you can reproduce with --seed N. Conflicts with --failed-first.
--seed <N>integeroptionalPin the shuffle seed for --random. Implies --random.
--changedflagoptionalRun only tests whose YAML or referenced cassette changed in the git working tree against --changed-base (default origin/main). Outside a git repo the selection is empty, so pair with --pass-with-no-tests.
--changed-base <REF>git refoptionalBase ref for --changed. Default origin/main.
-- <SERVER_ARG>...trailing argsoptionalAny argument after -- is appended to every stdio server's command line. Ignored for HTTP / SSE servers.

Agent-test env vars. Auto-detected provider families read these at run time. Missing keys are not an error; the runner falls back to a deterministic stub so CI stays green.

FamilyEnv varNotes
AnthropicANTHROPIC_API_KEYClaude models (claude-*).
OpenAIOPENAI_API_KEY plus optional OPENAI_ORG_IDgpt-*, chatgpt-*, o<digit>*, text-*, davinci-*.
GoogleGEMINI_API_KEY (falls back to GOOGLE_API_KEY)gemini-*.
MistralMISTRAL_API_KEYmistral-*, codestral-*, etc.

Custom OpenAI-compatible endpoints (Azure, OpenRouter, vLLM, LiteLLM, Together, Groq, Anyscale, Fireworks) are declared under top-level providers: in the YAML and reference whatever env var name you give them. See docs/models.md.

Examples.

# Smallest possible invocation; uses ./mcptest.yaml.
mcptest run

# Run a specific suite with JUnit output for CI.
mcptest --config tests/mcp.yaml --reporter junit --output reports/junit.xml run

# Point the suite at a preview deploy and wait for readiness.
mcptest --server-url https://preview-42.example.com \
        --wait-for-ready=2m \
        run

# Iterate on a single tag locally with debug logging.
mcptest --debug --filter '@smoke' run

# Record coverage and gate the run on a two-dimension threshold.
mcptest run --coverage --coverage-threshold tools=80,args=60

# Record agent cassettes against every key set in the environment.
ANTHROPIC_API_KEY=... OPENAI_API_KEY=... mcptest run --record

# Show which tests the runner would execute without running them.
mcptest --filter weather run --collect-only

# Stop at the first failure (developer's inner loop).
mcptest run --bail

# Iterate on yesterday's failures only; bail on the first one.
mcptest run --last-failed --bail

# CI matrix: three workers each take a stable third of the suite.
mcptest run --shard 1/3 --pass-with-no-tests

# See what the runner would dispatch without connecting to anything.
mcptest run --print-config

# Gate the run on a saved timing baseline.
mcptest run --check-baseline mcptest.timing-baseline.yml --tolerance-pct 25

Timing baseline (--check-baseline). Compares the run against a saved baseline file (written by mcptest baseline update). Tests whose elapsed milliseconds exceed p90 * (1 + tolerance_pct / 100) print a regression line and exit non-zero. The default tolerance is 0 (any overrun is a regression); a busy CI runner may want 25 or 50 to absorb noise. Cache hits and skipped tests are excluded from the check.

Exit codes. 0 on success, 1 on any test failure or baseline regression, 2 on configuration error, 3 on --wait-for-ready timeout, 6 on --coverage-threshold miss, 7 on "no tests selected" (use --pass-with-no-tests to treat as success), 124 on a hard runner timeout.

baseline

Manage the timing-baseline file the mcptest run --check-baseline gate reads. v1 ships one subcommand:

mcptest baseline update [--samples 20] --from-report run.json BASELINE

update reads a saved canonical JSON run report (mcptest run --reporter json --output run.json produces one) and folds each non-skipped, non-cache-hit test row into the baseline file. When the baseline file does not exist, it is bootstrapped from the report; when it does, the per-test p50 / p90 are blended via a rolling window (--samples, default 20). Tests present in the baseline but absent from this run are preserved so a transient --tag filter does not silently delete them.

Workflow.

# Refresh the baseline from a clean run.
mcptest run --reporter json --output run.json
mcptest baseline update --from-report run.json mcptest.timing-baseline.yml

# Gate subsequent runs against the saved baseline.
mcptest run --check-baseline mcptest.timing-baseline.yml --tolerance-pct 25

conformance

Score a running MCP server against the vendored SEP corpus, refresh the corpus from upstream, or list which check ids the corpus carries. Three subcommands:

mcptest conformance run [FLAGS]          # score against the corpus
mcptest conformance refresh [FLAGS]      # pull the latest SEPs
mcptest conformance check-ids [FLAGS]    # list check ids

The corpus ships baked into the binary at compile time, so a cargo install mcptest user can score offline. The resolution order for the corpus directory is: --corpus-dir if set, else the XDG user cache (~/.cache/mcptest/conformance/), else the embedded fallback. Each subcommand surfaces corpus_source on its report so a reader can tell which path the run used.

conformance run flags.

FlagDefaultPurpose
--server <URL>noneMCP server to probe. Wire-probe integration ships as a follow-up; the v1 report scores from the corpus content only.
--target-version <V>latest availableWhich spec revision to score. Defaults to the lexicographically greatest version available locally.
--corpus-dir <PATH>(resolution order)Override the corpus location.
--format <FORMAT>prettyOne of pretty, json, markdown, html.
--out <PATH>stdoutWhere to write the report.
--auto-refreshoffTrigger refresh when the requested version is missing locally. Off by default so a run never silently makes a network call.

conformance refresh flags.

FlagDefaultPurpose
--target-version <V>latestSpec version to fetch. latest resolves to the newest entry in mcptest_core::conformance::releases::KNOWN_RELEASES.
--corpus-dir <PATH>user cacheDestination. Never special-cases the in-repo crates/mcptest-core/seps/ path; maintainers pass it explicitly when refreshing the vendored copy.
--url <URL>upstreamOverride the upstream repository.
--ref <REF>(from KNOWN_RELEASES)Pin to a specific tag or SHA.
--source-path <PATH>src/sepsSubdirectory in the upstream tree to mirror.
--dry-runoffPrint what would be fetched and where, without writing.

The refresh transport is an HTTPS GET to codeload.github.com/<owner>/<repo>/tar.gz/<ref>, extracted in memory. Set GITHUB_TOKEN to lift GitHub's 60-req/hr anonymous rate limit.

conformance check-ids flags.

FlagDefaultPurpose
--target-version <V>latest availableWhich corpus to inspect.
--corpus-dir <PATH>(resolution order)Override the corpus location.
--missing-onlyoffPrint only the unimplemented check ids.
--format <FORMAT>prettyOne of pretty or json.
# Score the embedded corpus and write the JSON envelope.
mcptest conformance run --format json --out report.json

# Refresh the user cache from upstream (no token needed for the
# default rate-limited path).
mcptest conformance refresh

# List the check ids the runner has not implemented yet.
mcptest conformance check-ids --missing-only

init

mcptest [GLOBAL_OPTIONS] init [--with-jury] [--force] [--url URL | --from-discovered NAME]

Description. Scaffold a new mcptest project in the current directory. Creates a starter tests/example.yaml and a mcptest.yml config. Safe to run inside an empty directory; refuses to overwrite existing files unless --force is supplied. --url forks to a URL-target template; --from-discovered scaffolds from a server found in a local MCP client config (see doctor).

Arguments.

ArgumentTypeRequiredDescription
--with-juryflagoptionalAppend a v1.0 LLM-judge example to tests/example.yaml. The block is marked as a v1.0 feature in a comment so users understand it is forward-looking.
--forceflagoptionalOverwrite existing files. Default behavior is to refuse.
--from-discovered <NAME>stringoptionalScaffold a stdio suite from a server discovered in a local MCP client config (the names mcptest doctor lists). Conflicts with --url. Env-var names are surfaced as a comment; their values are never copied into the scaffold.

Examples.

# Scaffold a project in the current directory.
mkdir my-mcp-tests && cd my-mcp-tests
mcptest init

# Include the v1.0 jury example.
mcptest init --with-jury

# Scaffold from a server already configured in a local MCP client.
mcptest init --from-discovered github

# Overwrite stale scaffolding.
mcptest init --force

Exit codes. 0 on success, 2 when a target file already exists and --force was not supplied.

Status. Working.

doctor

mcptest [GLOBAL_OPTIONS] doctor [--no-tool-tokens] [--tokenizer NAME]

Description. Diagnose the local environment and server connectivity. Lists which dotenv files were discovered, how many variables came from each source, and (when wired) the cost of the server's tools/list catalog measured in tokens. It also prints a test-readiness inventory of the MCP servers found across local client configs (Claude Desktop, Claude Code, Cursor, VS Code, Windsurf, Codex), showing identity, transport, and presence-of-auth only, with secrets redacted (see server discovery).

Alongside the token total, doctor reports a tool-search posture signal: friendly or heavy. A catalog is friendly when its token cost is at or under 20K and it advertises ten tools or fewer; otherwise it is heavy. The threshold sits well under the roughly 55K-token cost of a five-server MCP setup that Anthropic's advanced-tool-use guidance describes, where real systems reach about 134K and the Tool Search Tool defers definitions to cut catalog token cost by about 85 to 95 percent. A friendly catalog is cheap enough to load up front; a heavy catalog is large enough that deferred loading (tool search) would pay off.

The pure computations behind doctor (env discovery, tokenizer accounting, posture classification) are fully unit-tested. The live tools/list probe is still pending.

Arguments.

ArgumentTypeRequiredDescription
--no-tool-tokensflagoptionalDisable the tool catalog token cost check. Use when the server is unreachable and you want the rest of the doctor report.
--tokenizer <NAME>stringoptional (default cl100k_base)Override the tokenizer used for the catalog token cost check. Supported: cl100k_base (GPT-3.5/4), o200k_base (GPT-4o), gpt2, claude (currently aliased to cl100k_base), and whitespace (transport-free approximation).
--lint-descriptionsflagin-flightRun the catalog description quality lint. Not present in the v1.0 binary; it will get its own exit code when the upcoming release ships.

Examples.

# Default doctor run.
mcptest doctor

# Skip the live tools/list probe (offline triage).
mcptest doctor --no-tool-tokens

# Use a specific tokenizer for the catalog cost report.
mcptest doctor --tokenizer o200k_base

# Combine with --verbose to also see resolver decisions.
mcptest --verbose doctor

Exit codes. 0 when the report renders successfully. 1 is reserved for a doctor probe that fails outright. 7 is reserved for --lint-descriptions failures.

Migration probe (--target-version). Pair --url with --target-version 2026-07-28 to run the migration doctor. It adds a one-shot initialize probe after the regular pipeline and reports one row per breaking change from the migration pair-corpus. v1 detects the deprecated capabilities (Roots / Sampling / Logging); other categories surface as [SKIP] with a one-line rationale and a follow-up ticket reference (stateless transport, schema validator, auth pack). Pair with mcptest lint for the offline YAML and cassette scan. A [FAIL] row gates CI (exit 1).

mcptest doctor --url https://mcp.example.com --target-version 2026-07-28

Status. Working. The tool catalog token check is wired but the live tools/list call is still pending; until then the handler reports that the check is wired but not yet runnable. The tool-search posture signal is computed from the same catalog token cost. --lint-descriptions is in-flight.

validate

mcptest [GLOBAL_OPTIONS] validate

Description. Validate the YAML config against the published JSON Schema (schemas/v1.json). Useful as a pre-commit hook and as the first step in any CI pipeline: it catches typos and structural mistakes before any test runs.

The path to validate is taken from the global --config flag so behavior is consistent with run.

Arguments.

ArgumentTypeRequiredDescription
(none)The file to validate is taken from --config or ./mcptest.yaml.

Examples.

# Validate the default ./mcptest.yaml.
mcptest validate

# Validate a specific suite (useful in a multi-suite repo).
mcptest --config tests/integration/mcp.yaml validate

# Run validate as a pre-commit step.
git diff --cached --name-only | grep -q '\.ya\?ml$' && mcptest validate

Exit codes. 0 when the file parses and validates. 2 on a schema violation, broken ${VAR} reference, missing import, or unreadable file. Every finding from every layer is reported in a single pass.

Status. Working.

schema

mcptest [GLOBAL_OPTIONS] schema [--version v1]

Description. Emit the JSON Schema for the YAML config to stdout. Output is byte-equivalent to https://mcptest.sh/schema/v1.json. Use it to wire mcptest into IDEs (VS Code's YAML extension, IntelliJ) so authors get autocomplete and inline validation while they type.

Arguments.

ArgumentTypeRequiredDescription
--version <VERSION>stringoptional, default v1Schema version to emit. Only v1 is shipped today; a future v2 will land as a separate match arm.

Examples.

# Print the schema to stdout.
mcptest schema

# Pipe it into a tool that consumes JSON Schema.
mcptest schema > .vscode/mcptest.schema.json

# Validate ad-hoc YAML against the schema using a third-party tool.
mcptest schema | check-jsonschema --schemafile - my-tests.yaml

Exit codes. 0 on success. 2 when --version names an unknown schema revision.

Status. Working.

coverage

mcptest [GLOBAL_OPTIONS] coverage [--threshold PERCENT] [--format FORMAT]

Description. Compute per-tool and per-resource coverage metrics for the server's surface. Reports which tools and resources were exercised by the test suite and which were skipped, so authors can spot dead corners of their server.

The pure computation in mcptest_core::coverage is fully unit-tested. Running it requires a runner that records which tools and resources were exercised at execution time.

Arguments.

ArgumentTypeRequiredDescription
--threshold <PERCENT>float, 0 to 100optionalQuality gate threshold as a percentage. When set and the computed coverage is below the threshold, the runner exits with code 6.
--format <FORMAT>enum {pretty, json}optional, default prettyOutput format for the coverage report. pretty renders a human-friendly table; json emits the structured CoverageReport.

Examples.

# Local check after a test run, pretty output.
mcptest coverage

# Hard gate at 80% coverage in CI.
mcptest coverage --threshold 80

# Machine-readable JSON for downstream tooling.
mcptest coverage --format json > coverage.json

# Combine with --filter to scope coverage to a subset.
mcptest --filter '@public' coverage --threshold 90

Exit codes. 0 when coverage is computed (and meets the threshold when one is set). 6 when --threshold is set and the report does not meet it.

Status. Stub in v1.0. The handler prints the requested threshold and format, then exits 0. Live wiring lands when the runner records exercised tools and resources.

report

mcptest [GLOBAL_OPTIONS] report <INPUT> [--format FORMAT] [--output PATH]

Description. Re-render a previously-saved JSON run as another reporter format. Saves a re-run when CI already captured the canonical JSON but a different consumer (a PR comment, a GitHub job summary, a SARIF importer) wants a different shape.

The input is the JSON written by mcptest run --output run.json --reporter json or any equivalent invocation. The redaction policy is re-applied at the dispatch site so every output shape shares one redacted view.

Arguments.

ArgumentTypeRequiredDescription
<INPUT>filesystem pathyesPath to a JSON report previously written by mcptest run.
--format <FORMAT>enum, see belowoptional, default prettyReporter format to render in.
--output <PATH>filesystem pathoptionalWrite to this file instead of stdout.

Accepted --format values:

FormatDescription
prettyHuman-friendly text output (default).
jsonPretty-printed JSON, round-trippable through the same model.
junitJUnit XML suitable for dorny/test-reporter and CircleCI Insights.
mdGitHub-flavored Markdown for PR comments and job summaries.
htmlSingle-file HTML report with inline CSS.
sarifSARIF 2.1.0 for GitHub code-scanning and similar consumers. See sarif-reporter.md.
gitlabGitLab Code Quality JSON for merge request widgets. See gitlab-code-quality.md.
ndjsonNewline-delimited JSON: one test record per line, then a summary. For log pipelines and jq -c.
tapTest Anything Protocol v14, for prove/tappy-style consumers.
matrixSelf-contained HTML test-by-model comparison grid. See matrix-reporter.md.
matrix-mdThe comparison grid as GitHub-flavored Markdown.
uploadHTTPS upload of the canonical run envelope to --upload-endpoint (preview).

Examples.

# Re-render a saved run as JUnit XML for CI.
mcptest report run.json --format junit --output junit.xml

# Produce a Markdown summary for a PR comment.
mcptest report run.json --format md --output pr-summary.md

# Generate an HTML report for sharing in Slack.
mcptest report run.json --format html --output run.html

# Ship a run envelope to a collector.
mcptest --upload-endpoint https://collector.example.com/v1/runs \
        --upload-token-env COLLECTOR_TOKEN \
        --upload-organization acme \
        report run.json --format upload

# Print the round-trippable JSON to stdout.
mcptest report run.json --format json

Exit codes. 0 on a successful render or upload. 1 when the input file is missing or malformed, or when the upload CLI is misconfigured (no endpoint, bad URL). 2 when the collector returns an error or declines the upload.

Status. Working. Every format above is wired. The upload format is documented as a preview because the wire envelope schema is not yet finalized.

eval

mcptest [GLOBAL_OPTIONS] eval [--max-cost USD] [--no-verdict-cache] [--explain]

Description. Run quality evaluations against an MCP server using an LLM judge. v1.0 ships single-judge mode: every entry in the evals: block is graded by one juror, the verdict and rationale are pretty-printed, and a cost budget tracks cumulative spend. Multi-juror consensus, bias mitigations, and inter-juror agreement are v1.0 features.

Arguments.

ArgumentTypeRequiredDescription
--max-cost <USD>floatoptionalHard ceiling in USD across every LLM-judge call. Accepts an optional leading $. The runner stops dispatching new tests once cumulative spend would exceed the cap.
--no-verdict-cacheflagoptionalDisable the LLM-judge verdict cache for this run. Overrides evals.cache.verdicts: true in YAML.
--explainflagoptionalPrint what each eval would grade (rubric, candidate source, judge model, judge-call count) without calling any provider or spending tokens, then exit.

Examples.

# Run every eval in mcptest.yaml.
mcptest eval

# Cap total LLM-judge spend at one dollar.
mcptest eval --max-cost $1.00

# Force fresh verdicts even when the YAML opts into caching.
mcptest eval --no-verdict-cache

Exit codes. 0 on success. 1 on a failed evaluation. 5 when the configured cost cap is exceeded.

Status. Working single-judge mode. Multi-juror consensus follows in v1.0.

diff

mcptest [GLOBAL_OPTIONS] diff <OLD> <NEW> [--format FORMAT] [--fail-on-breaking BOOL] [--scorecard]

Description. Diff two saved tools/list JSON snapshots and report which tools were added, removed, or reshaped, flagging each change as breaking or non-breaking so a CI job can fail loudly on a real regression. The --scorecard flag appends a release grade summarizing the diff.

Arguments.

ArgumentTypeRequiredDescription
<OLD>filesystem pathyesPath to the old snapshot (the baseline).
<NEW>filesystem pathyesPath to the new snapshot (the candidate).
--format <FORMAT>enum {pretty, json, markdown}optional, default prettyOutput format for the diff report.
--fail-on-breaking <BOOL>booleanoptional, default trueExit code 1 when at least one change is breaking. Set to false for advisory CI output rather than a hard gate.
--scorecardflagoptional, default offAppend a release scorecard (A+ / A / B / C / D / F letter grade plus per-tool added / removed / regressed callouts) to the diff output.

Scorecard grading. Aggregates the diff into one letter grade:

GradeTrigger
A+No changes at all. Identical snapshots.
AAt least one safe change, no breaking changes.
BExactly one breaking change.
CTwo or three breaking changes.
DFour or more breaking changes.
FAny tool removed between old and new (the most disruptive change a server can ship).

The grade matches the spirit of the compliance grade table so the two scorecards line up in marketing material.

Examples.

# Local diff against the previous saved snapshot, pretty output.
mcptest diff snapshots/old.json snapshots/new.json

# Hard fail in CI when any change is breaking.
mcptest diff snapshots/main.json snapshots/pr.json --fail-on-breaking true

# Advisory output for a PR comment (does not fail the job).
mcptest diff snapshots/main.json snapshots/pr.json \
        --format markdown \
        --fail-on-breaking false > pr-comment.md

# Machine-readable JSON for downstream tooling.
mcptest diff snapshots/main.json snapshots/pr.json --format json

Exit codes. 0 when there are no breaking changes (or --fail-on-breaking false is set). 1 when --fail-on-breaking is true and at least one change is breaking, or when a snapshot file is missing or malformed.

Status. Working. The diff engine lives in mcptest_core::diff; the CLI handler loads the JSON snapshots, calls diff_tool_catalogs, and renders the report.

lint

mcptest [GLOBAL_OPTIONS] lint [PATH...] [--format pretty|json] [--no-fail]

Scan YAML suites and JSON cassettes for usage of features the MCP 2026-07-28 spec deprecates. One-shot, offline: no server is contacted. Walks every *.yaml / *.yml / *.json file under each PATH (defaults to the current directory) and emits one finding per hit. target/, node_modules/, and .git/ are skipped.

Detected patterns:

KindTriggerReplacement
rootsroots/list methodTool parameters, resource URIs, or server configuration.
samplingsampling/createMessage methodDirect integration with LLM provider APIs.
logginglogging/setLevel or notifications/message methodstderr for stdio servers; OpenTelemetry for structured observability.
tasks-listtasks/list method (removed in the Tasks extension lifecycle)tasks/get polling on a server-issued task handle.
legacy-error-codeLiteral -32002 error codeStandard JSON-RPC -32602 Invalid Params code.

Exit code. 0 when no findings land or with --no-fail. 1 when any finding is reported. Useful as a CI gate.

Example.

mcptest lint examples/ docs/
mcptest lint --format json suites/ > deprecations.ndjson
mcptest lint --no-fail .  # advisory mode

Status. Working. The migration doctor will add a live probe that complements this offline scan.

migrate

mcptest [GLOBAL_OPTIONS] migrate [PATH...] [--to 2026-07-28] [--write]

Rewrite YAML suites toward an MCP spec revision. v1 covers the 2026-07-28 target only; any other --to value is a clear error. Two kinds of rewrite ship:

  1. Annotation. Every deprecated-feature hit gets a # TODO(mcptest-migrate) comment inserted immediately above the offending line, pointing at the replacement guidance from the migration corpus. The original line is left intact so the file still parses and so the operator can apply the human-judgement rewrite.
  2. Mechanical rewrite. The legacy -32002 JSON-RPC error code has a safe one-to-one replacement (-32602 Invalid Params). The migrator annotates the line with a TODO and substitutes the literal token.

The same deprecation catalog mcptest lint uses drives the migrator, so a run that lints clean migrates as a no-op.

Flags.

FlagDefaultDescription
--to <VERSION>2026-07-28Target spec revision. v1 supports 2026-07-28 only.
--writeoffApply the rewrites in place. Default is dry-run (print the per-file action plan).

Exit code. 0 on a successful run (writes applied or dry-run completed). 2 if --to names an unsupported target version.

Example.

mcptest migrate examples/             # dry-run, prints what would change
mcptest migrate --write suites/       # apply the rewrites in place

Status. Working: YAML annotation plus the legacy error-code rewrite. Cassette rewrites land in a later release alongside the streamable-HTTP transport rerouting.

discover

mcptest [GLOBAL_OPTIONS] discover <SERVER> [--output PATH] [--bearer-token-env NAME]

Description. Connect to an MCP server, run the initialize handshake, and call tools/list, resources/list, and prompts/list. Pretty-prints the discovered capabilities to stderr and writes a starter tests.yaml with one smoke test per tool.

Arguments.

ArgumentTypeRequiredDescription
<SERVER>URL or name=url pairyesMCP server endpoint. Bare URLs are labelled discovered; name=https://... lets you pick a friendlier server name in the scaffolded YAML.
--output <PATH>filesystem pathoptional, default tests.yamlPath for the scaffolded suite.
--bearer-token-env <NAME>stringoptionalRead a bearer token from this env var and send it as Authorization: Bearer <value> during the probe.

Examples.

# Scaffold against a local server.
mcptest discover http://localhost:8080/mcp

# Scaffold against an authenticated server, into a custom path.
MCP_TOKEN=abc mcptest discover https://api.example.com/mcp \
        --bearer-token-env MCP_TOKEN \
        --output suites/example.yaml

Exit codes. 0 on success, 1 when the handshake fails or the server is unreachable.

Status. Working.

completions

mcptest [GLOBAL_OPTIONS] completions <SHELL>

Description. Emit a shell completion script for the chosen shell. Lives in this subcommand rather than as a flag so users can pipe straight into a shell init file.

Arguments.

ArgumentTypeRequiredDescription
<SHELL>enum {bash, zsh, fish, powershell, elvish}yesShell to emit completions for. Backed by clap_complete::Shell.

Examples.

# Bash, system-wide.
mcptest completions bash | sudo tee /etc/bash_completion.d/mcptest > /dev/null

# Zsh, per-user.
mcptest completions zsh > ~/.zsh/completions/_mcptest

# Fish, per-user.
mcptest completions fish > ~/.config/fish/completions/mcptest.fish

# PowerShell.
mcptest completions powershell | Out-String | Invoke-Expression

Exit codes. 0 on success. clap rejects unknown shells with its standard error path (exit code 2).

Status. Working. The handler is a single call into clap_complete.

model-compat

mcptest [GLOBAL_OPTIONS] model-compat <SUBCOMMAND>

Description. Capture, diff, and replay model-compatibility baselines. The v1.0 headline workflow: snapshot the suite against a model, then compare a later run against that snapshot to classify every assertion as PASS, DRIFT, or FAIL.

Subcommands.

SubcommandPurpose
captureWrite a baseline JSON file for the given model.
diffCompare two saved baselines and render the result.
runRe-run the suite against a candidate and diff against the baseline.

mcptest model-compat capture.

mcptest model-compat capture --model ID --output PATH --input PATH [--filter GLOB]
ArgumentTypeRequiredDescription
--model <ID>stringyesProvider-qualified model identifier (for example anthropic:claude-sonnet-4.5).
--output <PATH>filesystem pathyesDestination for the baseline JSON.
--input <PATH>filesystem pathyesSource baseline file from the runner. The live runner integration is a follow-up ticket; for v1.0 this flag lets capture ride on a pre-built baseline so the CLI surface is exercised end to end.
--filter <GLOB>stringoptionalNarrow the captured assertion list with a *-style glob.

mcptest model-compat diff.

mcptest model-compat diff <BASELINE> <CANDIDATE> [--format FORMAT] [--filter GLOB]
ArgumentTypeRequiredDescription
<BASELINE>filesystem pathyesThe baseline (left-hand side).
<CANDIDATE>filesystem pathyesThe candidate (right-hand side).
--format <FORMAT>enum {pretty, json}optional, default prettyOutput format. JSON renders the full BaselineDiff.
--filter <GLOB>stringoptionalNarrow which assertion ids appear in the diff.

mcptest model-compat run.

mcptest model-compat run --baseline PATH --model ID --candidate PATH
ArgumentTypeRequiredDescription
--baseline <PATH>filesystem pathyesSaved baseline to diff against.
--model <ID>stringyesCandidate model identifier; printed in the report header.
--candidate <PATH>filesystem pathyesPre-captured candidate baseline. The live runner lands in a follow-up ticket; for v1.0 this flag lets run exercise the PASS/DRIFT/FAIL exit handling.

Examples.

# Capture a baseline against the current production model.
mcptest model-compat capture \
        --model anthropic:claude-sonnet-4.5 \
        --output baselines/sonnet-4.5.json \
        --input artifacts/last-run.json

# Compare two saved baselines as JSON for a CI gate.
mcptest model-compat diff baselines/sonnet-4.5.json baselines/sonnet-5.0.json --format json

# Run the suite against a candidate and exit per the rubric.
mcptest model-compat run \
        --baseline baselines/sonnet-4.5.json \
        --model anthropic:claude-sonnet-5.0 \
        --candidate artifacts/sonnet-5.0.json

Exit codes. 0 PASS (every assertion classified PASS). 6 DRIFT (at least one DRIFT, no FAIL). 1 FAIL (any invariant violated or assertion missing).

Status. Working. Library entry points live in mcptest_core::model_compat; the CLI dispatches to commands::model_compat.

compliance

mcptest [GLOBAL_OPTIONS] compliance <SUBCOMMAND>

Description. Score an MCP server against the compliance rubric and render the result in one of four formats. The score is the same ComplianceScore from mcptest_core::compliance::scoring; the four renderers in mcptest_core::compliance::renderers (pretty, JSON, Markdown, HTML) own the presentation. When a baseline is supplied the four BaselineDecision outcomes drive the exit code so CI can stay green while a known set of MUSTs is still pending.

Subcommands.

SubcommandPurpose
runRun the compliance corpus and render the score.
invariantsEvaluate spec-derived conformance invariants over a captured session, plus multi-server composition-safety checks.

mcptest compliance run.

mcptest compliance run \
        --results-from PATH \
        [--format FORMAT] \
        [--server-label LABEL] \
        [--registry PATH] \
        [--capabilities LIST] \
        [--baseline PATH | --expected-failures PATH] \
        [--update-baseline] [--yes]
ArgumentTypeRequiredDescription
--results-from <PATH>filesystem pathyesJSON list of CheckResult records produced by the runner. The live runner integration lands in a follow-up ticket; for v1.0 this flag lets the score and reporter surface ride on a pre-built artifact.
--format <FORMAT>enum {pretty, json, markdown, html}optional, default prettyOutput format.
--server-label <LABEL>stringoptionalLabel printed in the report header. Falls back to the global --server-url when omitted.
--registry <PATH>filesystem pathoptional, default compliance/registry.ymlPath to the rule registry YAML.
--capabilities <LIST>comma listoptionalCapabilities the server declared during initialize (for example tools,resources). Drives section applicability.
--baseline <PATH>filesystem pathoptionalBaseline file listing expected failures. Layers the four BaselineDecision outcomes on top of the run.
--expected-failures <PATH>filesystem pathoptionalAlias for --baseline.
--update-baselineflagoptionalRewrite the baseline file from the current run after a confirmation prompt. Requires --baseline (or --expected-failures).
--yesflagoptionalSkip the confirmation prompt for --update-baseline.

Exit codes. Without a baseline: 0 when no MUSTs failed, 1 when at least one did. With a baseline:

DecisionMeaningExit
NormalPassCheck passed and is not on the baseline.0
ExpectedFailureCheck failed but is on the baseline.0
NewRegressionCheck failed and is NOT on the baseline.1
StaleBaselineCheck passed but is still on the baseline.1

Examples.

# Render the score as Markdown for a PR comment.
mcptest compliance run \
        --results-from artifacts/compliance.json \
        --format markdown \
        --capabilities tools,resources

# Gate CI on a baseline so known failures stay green.
mcptest compliance run \
        --results-from artifacts/compliance.json \
        --baseline compliance-baseline.yml

# Regenerate the baseline after a deliberate cleanup.
mcptest compliance run \
        --results-from artifacts/compliance.json \
        --baseline compliance-baseline.yml \
        --update-baseline --yes

mcptest compliance invariants.

mcptest compliance invariants --capture PATH [--format FORMAT]
ArgumentTypeRequiredDescription
--capture <PATH>filesystem pathyesJSON capture file. A single session object runs the per-server invariants; an array of sessions also runs the multi-server composition-safety checks.
--format <FORMAT>enum {pretty, json}optional, default prettyOutput format.

The invariants are the INV-NNN family (handshake ordering, capability attestation, tool result-shape, JSON-RPC error envelopes). With two or more sessions the command also checks tool-namespace overlap and shared-transport id collisions. The capture is read from disk so the run is deterministic and contacts no server. Exit 0 when every invariant passes with no composition hazard, 1 otherwise. See docs/conformance-invariants.md.

Status. Working. Library entry points live in mcptest_core::compliance; the CLI dispatches to commands::compliance.

pipe

mcptest pipe <PIPELINE> [--url URL] [--bearer-token-env NAME]
                        [--var KEY=VALUE ...] [--format pretty|json]
                        [--dry-run [--estimate-cost]]
                        [--max-cost USD] [--max-tokens N]
                        [--max-cost-per-call USD] [--max-duration DURATION]
                        [--on-budget-exceeded stop|continue|warn]
                        [--pricing-table PATH]

Description. Run a declarative multi-step tool-call pipeline. Each step calls a tool, extracts values from earlier steps by reference, and binds them into later steps. The full pipeline YAML grammar, the reference expression language (${step.field}, ${var.X}, ${env.X}), the when: guard, on_error, and the cumulative budget controls are documented in reference/pipe.md.

Arguments.

ArgumentTypeRequiredDescription
<PIPELINE>pathyesThe pipeline YAML file.
--url <URL>stringoptionalMCP server the pipeline's tool calls target.
--bearer-token-env <NAME>stringoptionalEnv var holding a bearer token for every request.
--var <KEY=VALUE>repeatableoptionalInject a variable referenced as ${var.KEY}.
--format <pretty|json>enumoptionalpretty prints the last step's result; json prints the full execution trace.
--dry-runflagoptionalPrint the planned execution without making any tool calls.
--estimate-costflagoptionalWith --dry-run, also print a projected cost estimate.
--max-cost <USD>floatoptionalAggregate USD ceiling across all steps.
--max-tokens <N>integeroptionalAggregate token ceiling (input + output) across all steps.
--max-cost-per-call <USD>floatoptionalPer-step USD ceiling, layered on the aggregate cap.
--max-duration <DURATION>durationoptionalWall-clock ceiling (30s, 2m, 1h).
--on-budget-exceeded <MODE>enumoptionalstop (default) fails fast, continue runs to completion then exits non-zero, warn runs and exits 0.
--pricing-table <PATH>pathoptionalOverride the bundled pricing.yaml used for cost estimates.

Example.

mcptest pipe examples/pipe-search-then-update.yml \
  --url http://localhost:8000/mcp --var USER_QUERY=alice --max-cost 0.50

Status. Working (pipelines and budgets).

tools, resources, prompts, capabilities

mcptest tools        --url URL [--bearer-token-env NAME] [--format pretty|json]
mcptest tools call <TOOL> --url URL [--args JSON | --args-from-stdin | --arg NAME=$.path ...]
                          [--bind NAME=$.path ...] [--then TOOL [--then-arg NAME=VALUE|NAME=:bound]...]
                          [--select $.path] [--json] [--max-cost USD]
mcptest resources    --url URL [--format pretty|json]
mcptest prompts      --url URL [--format pretty|json]
mcptest capabilities --url URL [--format pretty|json]

Description. Introspect a live server's catalog. The bare tools / resources / prompts forms list the catalog; capabilities prints the initialize capability block. tools call <TOOL> runs one tool imperatively with the chaining primitives so a shell pipeline can extract and forward values without reaching for jq.

tools call arguments.

ArgumentTypeDescription
<TOOL>stringTool to call.
--url <URL>stringMCP server endpoint.
--args <JSON>stringLiteral args object as a JSON string.
--args-from-stdinflagRead the entire args object from stdin JSON. Conflicts with --args.
--arg <NAME=$.path>repeatableExtract a value from stdin JSON by JSONPath and use it as the named arg.
--bind <NAME=$.path>repeatableCapture a value from this call's output for a chained --then step.
--then <TOOL>stringChain a second tool call within the same invocation.
--then-arg <NAME=VALUE>repeatableArgument for the --then step. name=:bound references a --bind capture.
--select <$.path>stringProject the output by JSONPath before printing.
--jsonflagEmit JSON instead of pretty output.
--max-cost <USD>floatAggregate USD ceiling across both calls in this invocation.

Example.

mcptest tools call search --url "$URL" --args '{"query":"alice"}' \
  --bind user_id=$.results[0].id \
  --then fetch_user --then-arg user_id=:user_id --select $.email

Status. Working (introspection and chaining).

inspect

mcptest inspect --url URL [--bearer-token-env NAME]
mcptest inspect -- <command> [args...]

Connect to one MCP server and explore it in an interactive REPL: the terminal sibling of the one-shot tools / resources / prompts / capabilities / discover commands. Target the server either over streamable HTTP (--url) or over stdio (everything after -- is the server command). The REPL reads commands from stdin, so a piped script drives it the same way an interactive session does.

REPL commands (type help to list them in-session):

CommandAction
tools / lsList tools.
call <tool> [json]Call a tool with a JSON-object argument (default {}).
resourcesList resources.
read <uri>Read a resource.
promptsList prompts.
prompt <name> [json]Get a prompt.
capabilities / capsShow the server capability block.
notifications / notifShow notifications received this session.
discover [path]Scaffold a tests.yaml from the live catalog (default tests.yaml).
quit / exit / qLeave the session.

Live server activity is surfaced as it arrives. Notifications (logging, progress, list_changed) print with a <- notification prefix. Server-initiated requests are fulfilled automatically so a server that drives the client can be exercised end to end: sampling/createMessage returns a stub assistant message (no real model call), elicitation/* is declined, and roots/list returns an empty list. inspect advertises the matching client capabilities during the handshake so the server knows it may use them.

Example.

# Stdio server
mcptest inspect -- npx -y @modelcontextprotocol/server-everything

# HTTP server, scripted (non-interactive)
printf 'tools\ncall search {"query":"alice"}\nquit\n' \
  | mcptest inspect --url "$URL" --bearer-token-env MCP_TOKEN

Status. Working (terminal-only; web viewers are out of scope).

mcp-server

mcptest mcp-server [--workspace PATH] [--enable-writes] [--mcptest-bin PATH]

Description. Run mcptest itself as a stdio MCP server so an MCP-aware agent (Claude Code, Cursor, mcp-inspector) can query runs, cassettes, coverage, and diagnostics from inside the editor. Read tools are always available; --enable-writes adds the run-triggering and cassette-recording tools. Full tool and resource catalog in mcp-server.md.

Arguments.

ArgumentTypeDescription
--workspace <PATH>pathWorkspace root. Defaults to the current directory.
--enable-writesflagEnable the write tools (trigger_run, record_cassette). Off by default.
--mcptest-bin <PATH>pathOverride the mcptest binary the write tools spawn.

Status. Working. Registered in an agent config as command: "mcptest", args: ["mcp-server", "--workspace", "."].

generate

mcptest generate stubs --url URL [--bearer-token-env NAME]
                       [--output DIR] [--overwrite] [--stdout] [--check]

mcptest generate suite --from-config FILE [--server-name NAME]
                       [--models ID,ID,...] [--no-edge] [--no-violation]
                       [--output PATH | --update PATH]

Description. Scaffold runnable YAML tests from a server's tool catalog. Wrapped under a subcommand so generators land as siblings.

generate stubs introspects a live server and emits one test stub file per advertised tool.

generate suite synthesizes one self-contained suite document from a server's declared tools: a servers: block, a multi-model matrix placeholder, and three cases per tool (valid arguments, a boundary edge case, and a schema-violation case that expects an error). The emitted YAML validates against schemas/v1.json, so it runs as written.

generate stubs arguments.

ArgumentTypeDescription
--url <URL>stringMCP server endpoint to introspect.
--bearer-token-env <NAME>stringEnv var holding a bearer token forwarded to every request.
--output <DIR>pathDirectory the generated YAML is written under. Default tests/tools.
--overwriteflagReplace existing stub files instead of skipping them.
--stdoutflagPrint every stub concatenated to stdout instead of writing to disk.
--checkflagExit 6 if any generated stub differs from the checked-in file. CI drift detection.

generate suite arguments.

ArgumentTypeDescription
--from-config <FILE>pathRead the declared tool list from a file: a tools/list JSON snapshot ({"tools": [...]}), a bare JSON tools array, or a YAML mock manifest (mock_server.tools[]).
--server-name <NAME>stringServer key the generated tests reference, also the servers: entry. Default server.
--models <ID,ID,...>listModel identifiers for the model_compatibility: matrix. Omit to use the built-in default lineup.
--no-edgeflagSkip the boundary edge case per tool.
--no-violationflagSkip the schema-violation case per tool.
--output <PATH>pathWrite the suite to a file instead of stdout.
--update <PATH>pathMerge into an existing suite, keeping every hand-authored test and appending only tests whose name is new. Mutually exclusive with --output.

Why a file, not a live connection. Reading tools from a file keeps generation deterministic and CI-reproducible. Live tools/list introspection reuses the same connector as generate stubs --url and is the planned follow-up.

Status. Working (stubs and suite).

mock

mcptest mock --tools-from PATH

Description. Spawn a YAML-driven stdio mock MCP server. The mock loads its tool catalog from --tools-from and serves it over stdio, so a client integration can be exercised without the real backend. v1.0 ships stdio only.

Arguments.

ArgumentTypeDescription
--tools-from <PATH>pathA YAML manifest (mock_server.tools[]) or a tools/list baseline JSON snapshot.

Status. Working. The cassette-driven mock (--cassettes) is a separate draft, see mcptest-mock.md.

exec

mcptest exec --connection-server [--config PATH] [--ipc-version VERSION]
             [--no-cache] [--no-cassette] [--record-cassettes]
             [--debug-output PATH] [--verbose]

Description. Run mcptest as an IPC co-process for the native SDKs. The SDK (pytest, Vitest, Go, etc.) spawns this command, pipes newline-delimited JSON-RPC over stdin/stdout, and reads canonical responses back. You do not run this by hand; the language SDKs invoke it.

Arguments.

ArgumentTypeDescription
--connection-serverflagRequired mode toggle (the only mode in v1).
--config <PATH>pathmcptest config. Defaults to mcptest.yaml.
--ipc-version <VERSION>stringPin the IPC envelope version. The dispatcher rejects a version newer than it supports.
--no-cacheflagDisable the cache for this session.
--no-cassetteflagDisable the cassette layer (no replay, no record).
--record-cassettesflagRecord new cassettes for any unmocked call seen this session.
--debug-output <PATH>pathWrite a verbatim transcript for SDK debugging.
--verboseflagEmit envelope counts and lifecycle events on stderr.

Status. Working.

login

mcptest login [SERVER] [--url URL] [--client-id ID] [--all] [--force] [--no-browser]

Description. Interactive OAuth 2.1 + PKCE login that caches a token for later runs. Discovers the IdP from the target's .well-known/oauth-authorization-server, runs the browser flow against a loopback listener, and caches the token (and any Dynamic Client Registration metadata) for subsequent mcptest run invocations.

Arguments.

ArgumentTypeDescription
[SERVER]stringNamed server from mcptest.yml to log in to. Mutually exclusive with --url and --all. Omit for the single configured URL server or an interactive picker.
--url <URL>stringAuthenticate against a URL not recorded in mcptest.yml. Conflicts with --all.
--client-id <ID>stringFallback OAuth client_id for IdPs without a registration_endpoint. Ignored when DCR is available.
--allflagLog in to every configured URL server in declaration order. Stdio servers are skipped with a warning.
--forceflagClear the cached token and DCR metadata before running the flow.
--no-browserflagPrint the authorization URL on stdout instead of opening a browser. The loopback listener still accepts the callback. CI and headless escape hatch.

Status. Working.

prompt

mcptest prompt [--output PATH]

Description. Print a copy-paste-ready grounding prompt for an LLM assistant writing mcptest YAML. Prints to stdout by default.

Arguments.

ArgumentTypeDescription
--output <PATH>pathWrite the prompt to a file instead of stdout.

Status. Working.

cache

mcptest cache [--cache-dir PATH] <list|stats|clear|prune>
mcptest cache clear [--older-than DURATION]

Description. Inspect or evict the local cache store. The --cache-dir override points the store at a non-default root and is accepted on every cache subcommand.

Subcommands.

SubcommandDescription
listList every cached entry with size and age.
statsPrint totals plus the hit-rate row.
clear [--older-than DURATION]Remove entries. Without --older-than, removes everything. Duration like 30m, 2h, 7d.
pruneRemove entries older than 30 days. CI-friendly alias so scripts do not pick a number.

Status. Working.

security

mcptest [GLOBAL_OPTIONS] security <SNAPSHOT> [--format FORMAT] [--fail-on SEVERITY]

Description. Scan a tools/list-style JSON snapshot with the bundled deterministic security checks and report the findings. No model decides a verdict: every check is a regex or structural predicate over the tool, prompt, and resource definitions, so a finding is reproducible. The first bundled lane is the tool-surface family (SEC-001 through SEC-009); see the security test catalog.

Arguments.

ArgumentTypeRequiredDescription
<SNAPSHOT>filesystem pathyesA JSON snapshot that may carry tools, prompts, and resources arrays.
--format <FORMAT>enum {pretty, json, sarif, html, md}optional, default prettyOutput format. SARIF 2.1.0 drops into code scanning; html/md emit a reviewer-grade vulnerability report with an OWASP LLM Top 10 coverage table (see security-report.md).
--fail-on <SEVERITY>enum {info, low, medium, high, critical}optional, default highExit code 1 when any finding is at or above this severity.

Examples.

# Human-readable summary.
mcptest security tools-list.json

# Hard fail in CI on any high or critical finding.
mcptest security tools-list.json --fail-on high

# SARIF for code scanning.
mcptest security tools-list.json --format sarif > security.sarif

Exit codes. 0 when nothing fires at or above --fail-on, 1 when something does, 2 when the snapshot cannot be read or parsed.

Subcommands.

Status. Working. The engine lives in mcptest_core::security; the active probes and the integrity, namespace, and advisory lanes are tracked under the security-framework epic.

security import

mcptest security import [--sarif FILE]... [--snyk FILE]... [--supplement FILE]... \
    [--snapshot FILE] [--advisory] [--format FORMAT] [--fail-on SEVERITY]

Description. mcptest owns the ingest, not the scan. import normalizes a scanner you already run into the finding vocabulary, dedups it against the bundled catalog (an overlapping SEC rule is counted once), and prints one unified report. SARIF 2.1.0 is read with --sarif, Snyk agent-scan JSON with --snyk, and any other JSON shape (a top-level array or a findings array) with --supplement. Each flag is repeatable. With --snapshot, the bundled deterministic lanes also run and the imports fold in beside them. See the external-scanner supplement.

Arguments.

ArgumentTypeRequiredDescription
--sarif <FILE>filesystem path, repeatableone of the threeA SARIF 2.1.0 log. The scanner name is read from its tool driver.
--snyk <FILE>filesystem path, repeatableone of the threeA Snyk agent-scan ScanPathResult JSON file.
--supplement <FILE>filesystem path, repeatableone of the threeA generic scanner JSON file.
--snapshot <FILE>filesystem pathoptionalA tools/list snapshot to also scan with the bundled lanes.
--advisoryflagoptionalMark every import advisory, so none of it gates.
--format <FORMAT>enum {pretty, json, sarif, html, md}optional, default prettyOutput format. html/md emit a vulnerability report with OWASP coverage (see security-report.md).
--fail-on <SEVERITY>enum {info, low, medium, high, critical}optional, default highExit 1 when any counted finding is at or above this floor.

Examples.

# Fold an AgentSeal SARIF file and a Snyk agent-scan JSON file into one report.
mcptest security import \
  --sarif examples/security/agentseal.sarif.json \
  --snyk examples/security/snyk-agent-scan.json

# Combine an import with a bundled snapshot scan, emitting SARIF.
mcptest security import --sarif scan.sarif \
  --snapshot tools-list.json --format sarif > security.sarif

Exit codes. 0 when nothing counted fires at or above --fail-on, 1 when something does, 2 when a file cannot be read or no scanner file is given.

sbom

mcptest [GLOBAL_OPTIONS] sbom [--format FORMAT] [--out PATH] [--verify]

Description. Print the CycloneDX 1.5 Software Bill of Materials that the build script baked into the binary at compile time, list licenses, or verify the embedded blob has not been swapped at runtime. The full guide lives at Software Bill of Materials.

Arguments.

ArgumentTypeRequiredDescription
--format <FORMAT>enum {cyclonedx, licenses, names}optional, default cyclonedxcyclonedx is the raw embedded JSON; licenses is one line per dep with its SPDX expression; names is one line per dep with just name and version.
--out <PATH>filesystem pathoptionalWrite the output here instead of stdout.
--verifyflagoptionalRe-hash the embedded BOM at runtime, compare to the build-time SHA, exit 0 on match and 2 on mismatch.

Examples.

# Pipe the BOM into a scanner.
mcptest sbom > mcptest.cdx.json

# Quick license inventory.
mcptest sbom --format licenses

# Confirm the embedded blob has not been tampered with at runtime.
mcptest sbom --verify

Exit codes. 0 on success or successful verification, 2 when --verify detects a hash mismatch.

Status. Working.

evidence

mcptest [GLOBAL_OPTIONS] evidence <REPORT> [--security FILE] [--reproducible] [--out PATH] [--sign]
mcptest [GLOBAL_OPTIONS] evidence verify <EVIDENCE> [--max-age DURATION] [--signature FILE] [--require-signed]

Description. Aggregate a mcptest run --format json report into a portable evidence artifact (server identity, spec version, corpus hash, source provenance, grades, reproducibility), or verify one. --sign reuses the release Sigstore cosign path to attach a detached signature. See portable run evidence.

Arguments (emit).

ArgumentTypeRequiredDescription
<REPORT>filesystem pathyes (unless a subcommand)A serialized mcptest run --format json report. Must carry run metadata.
--security <FILE>filesystem pathoptionalA mcptest security --format json report whose severity counts fold into the grades.
--reproducibleflagoptionalMark the run byte-reproducible (the sbom --verify / SOURCE_DATE_EPOCH parity signal).
--out <PATH>filesystem pathoptionalWrite the artifact here instead of stdout. Required with --sign.
--signflagoptionalSign the artifact with cosign sign-blob (keyless, GitHub OIDC), writing <out>.sig and <out>.cert. Requires cosign on PATH.

Arguments (verify).

ArgumentTypeRequiredDescription
<EVIDENCE>filesystem pathyesThe evidence.json artifact to verify.
--max-age <DURATION>duration (720h, 30m)optionalReject evidence whose generated_at is older than this.
--signature <FILE>filesystem pathoptionalDetached signature; defaults to <evidence>.sig when present.
--require-signedflagoptionalReject the artifact when it is unsigned.

Examples.

# Emit an artifact from a run, folding in a security scan.
mcptest evidence run.json --security security.json --reproducible --out evidence.json

# Sign it (needs cosign on PATH).
mcptest evidence run.json --out evidence.json --sign

# Verify: reject stale (>30d), forked, or unsigned evidence.
mcptest evidence verify evidence.json --max-age 720h --require-signed

Exit codes. Emit: 0 on success, 2 when the report cannot be read or carries no metadata (or --sign cannot run). Verify: 0 accepted, 1 rejected (reasons printed), 2 when the artifact cannot be read.

Status. Working. Cryptographic Sigstore verification (Rekor inclusion, certificate identity) is cosign verify-blob's job; evidence verify owns the freshness, commit-ancestry, and signature-presence policy.

ledger

mcptest [GLOBAL_OPTIONS] ledger emit <ENVELOPE> [--session-id ID] [--output PATH]
mcptest [GLOBAL_OPTIONS] ledger diff <BASELINE> <ACTUAL> [--max-diff N]

Description. Turn a saved agent run envelope into a session-ledger NDJSON file, or diff an actual ledger against a baseline trajectory. The ledger is the append-only, structured record of the tool calls a run made: one header record, then one tool_call record per call, in call order. See session ledger for the schema and field reference.

Arguments (emit).

ArgumentTypeRequiredDescription
<ENVELOPE>filesystem pathyesA JSON file holding an agent run envelope with a tool_calls array (each entry has name, server, args). This is the shape a single agent test produces in mcptest run --reporter json.
--session-id <ID>stringoptionalSession id stamped on every record.
--output <PATH>filesystem pathoptionalWrite the ledger here instead of stdout.

Arguments (diff).

ArgumentTypeRequiredDescription
<BASELINE>filesystem pathyesThe recorded baseline ledger NDJSON.
<ACTUAL>filesystem pathyesThe fresh ledger to compare.
--max-diff <N>integeroptional (default 0)Maximum tolerated divergent tool calls. The command exits non-zero once divergences exceed this; 0 requires an exact match.

Examples.

# Record a baseline trajectory from a saved envelope.
mcptest ledger emit envelope.json --session-id run-42 --output baseline.ndjson

# Gate a fresh run against the baseline in CI (exact match).
mcptest ledger diff baseline.ndjson actual.ndjson --max-diff 0

The diff compares tool calls position by position per agent_id: a different tool at a hop is a remove plus an add, a matching tool with different params is a param change.

  - removed  hop 1: get_weather
  + added    hop 1: delete
ledger diff: 2 divergence(s) exceed --max-diff 0

Exit codes. 0 clean (or divergences within --max-diff), 1 divergences exceed --max-diff, 2 when an input cannot be read.

Status. Working. The schema is owned here; see session ledger for the open-core boundary.

web-bot-auth

mcptest [GLOBAL_OPTIONS] web-bot-auth directory [--key PATH | --key-env VAR] [--algorithm ALG] [--agent URL]

Description. Emit the .well-known/http-message-signatures-directory JWK Set for a Web Bot Auth signing key. Only the public key is printed; the private key is never written to the output. See Web Bot Auth for the full signing and verification story.

Arguments (directory).

ArgumentTypeRequiredDescription
--key <PATH>filesystem pathone of --key/--key-envPKCS#8 PEM file holding the private signing key. Only the derived public key is emitted.
--key-env <VAR>env var nameone of --key/--key-envEnv var holding the PKCS#8 PEM private key, so the key never appears on the command line.
--algorithm <ALG>ed25519 or rsa-pssoptional (default ed25519)Signature algorithm. Must match the key type.
--agent <URL>URLoptionalSignature-Agent URL identifying the bot. Recorded in the validated config; it does not appear in the JWK Set itself.

Examples.

# Emit the public JWK Set for an Ed25519 key.
mcptest web-bot-auth directory --key bot.ed25519

# Read the key from an env var instead of a file.
mcptest web-bot-auth directory --key-env BOT_SIGNING_KEY

Exit codes. 0 on success, non-zero when the key is missing or malformed.

Status. Working.

Exit codes

mcptest uses a small, stable set of exit codes so CI scripts can react without parsing stdout. Every subcommand documents which codes it can return; this table is the central reference.

CodeMeaningSource
0Success. The command did what it was asked to do.All subcommands.
1Test failures or a malformed input artefact.run, report (bad input), diff (breaking change with --fail-on-breaking true), eval (failing verdict), compliance (regression vs baseline), model-compat (FAIL).
2Configuration error or invalid arguments.validate, init (write conflict), report (collector rejection), run (config load failed). clap also returns 2 for unknown flags.
3--wait-for-ready budget expired before a URL server accepted connections.run, doctor.
5Cost cap exceeded, or run --update-snapshots refused under CI=true.eval, run.
6Coverage below threshold, or a model-compat DRIFT.coverage and run --coverage-threshold, model-compat (DRIFT).
7No tests selected. The suite is empty, or --filter, --shard, or --last-failed matched nothing, and --pass-with-no-tests was not passed.run.

Codes outside this set are reserved. If you see one, it is almost certainly clap returning 2 for a parse error.

Note: a future doctor --lint-descriptions quality lint will land its own exit code when that feature ships; it is not wired in the v1.0 binary today.

Cross-references