Models and the compatibility matrix
mcptest's agent test type points a real model at a real MCP server and asserts against the resulting conversation: which tools the model called, what arguments it passed, the final text reply, and the conversation telemetry (tokens, duration, message count). A test you write once can be run against any number of models, so when a new release ships you keep the test, drop the new identifier into models:, and the report tells you instantly whether anything broke.
Provider families and env vars
| Family | Detected when model starts with | Env var(s) | Notes |
|---|---|---|---|
| Anthropic | claude- | ANTHROPIC_API_KEY | Tool use is first-class. |
| OpenAI | gpt-, chatgpt-, o<digit>, text-, davinci- | OPENAI_API_KEY (and optional OPENAI_ORG_ID) | Covers the gpt- and o-series reasoning models. |
gemini-, models/gemini- | GEMINI_API_KEY (falls back to GOOGLE_API_KEY) | AI Studio API. | |
| Mistral | mistral-, codestral-, magistral-, ministral-, devstral-, pixtral-, open-mistral-, open-mixtral- | MISTRAL_API_KEY | La Plateforme API. |
When mcptest sees a model string it doesn't recognize, the run keeps going against a deterministic stub instead of failing the suite. The reporter logs which provider was used or skipped per case, so you can see at a glance whether the matrix actually exercised the model you expected.
Adding a new family is a five-line patch to crates/mcptest-agent/src/provider.rs (predicate + env-var lookup + constructor). No CLI changes required.
The matrix form
agents:
- name: weather query routes to get_weather
models:
- claude-sonnet-4-5
- claude-opus-4-7
- gpt-5
- gemini-2.5-pro
servers: [weather]
prompt: What is the weather in Sacramento?
expect:
- target: tool_calls[0].name
matcher: { exact: get_weather }
- target: tool_calls[0].args.city
matcher: { regex: "(?i)sacramento" }
- target: conversation.tokens.total
matcher: { regex: "^[0-9]+$" }
What that prints (illustrative):
mcptest::weather query routes to get_weather [claude-sonnet-4-5] PASS
mcptest::weather query routes to get_weather [claude-opus-4-7] PASS
mcptest::weather query routes to get_weather [gpt-5] FAIL
tool_calls[0].name: expected get_weather, got search
mcptest::weather query routes to get_weather [gemini-2.5-pro] PASS
3 of 4 models passed.
model: <id> (singleton form) still works and is shorthand for models: [<id>]. Suites that already use model: keep running unchanged.
Record once, replay forever
Each (agent, model) pair gets its own cassette at cassettes/<agent_slug>__<model_slug>.json. The workflow:
# First record. Set every key you have; missing keys fall back to
# the stub so a partial record still produces a useful matrix.
ANTHROPIC_API_KEY=sk-ant-... \
OPENAI_API_KEY=sk-proj-... \
GEMINI_API_KEY=AIza... \
mcptest run --record
# Commit the cassettes. CI runs `mcptest run` (no flag) and replays
# them deterministically. No API key, no spend, exact same matchers.
git add cassettes/
git commit -m "record agent baseline"
Adding a new model to models: later does not invalidate the existing recordings. Run mcptest run --record again with the new provider's key set and only the new model's cassette is written.
When a release breaks something
The matrix is built for this case. Workflow:
- A new Claude / GPT / Gemini drops.
- Add the identifier to
models:in your suite. - With the relevant API key set, run
mcptest run --record. - mcptest writes a cassette for the new model only and shows you the first matcher that fails.
- Decide whether to fix your MCP server (it diverged), tighten the assertion (the old assertion was lucky), or document the regression.
If the new model also passes, commit the cassette and CI tracks the new baseline.
Cost guardrails
Both record runs and stub runs respect the per-test and per-suite budget knobs at the top of the YAML:
budget:
per_test_usd_cents: 50
per_suite_usd_cents: 200
Per-model fan-out multiplies the per-suite spend. If the matrix has four models and each costs around 1 cent per call, a recording pass is under five cents, but a runaway agent loop hits the budget and terminates loudly rather than silently.
Key redaction
The cassette writer scrubs anything that looks like a provider key (sk-ant-..., sk-proj-..., sk-..., AIzaSy...) out of the serialized JSON before it lands on disk. The same redaction runs on any error message the CLI emits, so a reqwest transport failure won't leak a key into a run record or a CI log.
Multi-server agent tests
Real agent workflows usually span more than one MCP server: open an issue and ping a channel, search a corpus and write a summary, schedule a meeting and email an invite. List every server the agent needs under servers: and the driver exposes the merged tool catalog to the model. Tool calls are routed back to the owning server and the trace records which server handled each call so suites can pin the routing.
servers:
issues:
command: ["./issues-server"]
notifications:
command: ["./notifications-server"]
agents:
- name: open issue and notify oncall
model: claude-sonnet-4-5
servers: [issues, notifications]
prompt: Open a P1 issue for the failing CI run and ping #oncall.
expect:
- target: tool_calls[0].server
matcher: { exact: issues }
- target: tool_calls[0].name
matcher: { exact: create_issue }
- target: tool_calls[1].server
matcher: { exact: notifications }
- target: tool_calls[1].name
matcher: { exact: send_message }
The model sees each tool as <server>__<tool> so name collisions across servers cannot happen (issues__create_issue versus a hypothetical notifications__create_issue). Tool calls in the trace keep the bare tool name and gain a server field so the YAML target grammar stays clean. Single-server runs keep the bare tool name on the wire too, so existing one-server suites are unaffected.
A worked example lives at examples/agent-issues-and-notifications.yml.
Custom OpenAI-compatible endpoints
When you need to target an endpoint the auto-detect cannot identify (Azure OpenAI, OpenRouter, vLLM, llama.cpp server, LiteLLM, Together, Groq, Anyscale, Fireworks, Anthropic via Bedrock, etc.), declare it under a top-level providers: block and reference it from models::
providers:
openrouter:
type: openai # the only supported wire protocol today
base_url: https://openrouter.ai/api/v1
api_key_env: OPENROUTER_API_KEY
azure-prod:
type: openai
base_url: https://my-resource.openai.azure.com/openai/deployments/my-gpt-5
api_key_env: AZURE_OPENAI_KEY
organization: my-azure-org # optional, sent as OpenAI-Organization
local-vllm:
type: openai
base_url: http://localhost:8000/v1
# api_key_env omitted -> the runner sends `Authorization: Bearer EMPTY`
# which the common self-hosted servers accept.
agents:
- name: weather query routes to get_weather
models:
- claude-sonnet-4-5 # auto-detect (uses ANTHROPIC_API_KEY)
- { provider: openrouter, id: openai/gpt-4o } # named provider
- { provider: openrouter, id: anthropic/claude-3.5-sonnet }
- { provider: azure-prod, id: my-gpt-5 }
- { provider: local-vllm, id: meta-llama/Llama-3.1-70B-Instruct }
servers: [weather]
prompt: What is the weather in Sacramento?
expect:
- target: tool_calls[0].name
matcher: { exact: get_weather }
The reporter labels rows that came from a named provider as [provider/id] (so weather query [openrouter/openai/gpt-4o]), keeping auto-detect rows at [id]. Cassettes for named entries land at cassettes/<agent>__<provider>__<model>.json so the same model id served by two providers stays isolated on disk.
When to use which
- Auto-detect for the four big public APIs (Anthropic, OpenAI, Google, Mistral). Lowest-friction; just set the canonical env var and write the model id.
- Named provider for everything else (Azure, gateways, local). More YAML but you decide the endpoint and the key.
Adding a model the CLI doesn't recognize yet
Two paths:
- Quickest: declare a
providers:entry pointed at the right endpoint and reference it frommodels:. No code change required. - Cleanest: add a five-line predicate + env-var lookup in
crates/mcptest-agent/src/provider.rsso the family is auto-detected by future suites.
Either works; the first is the right move when you're trying things out and the second is the right move when a family is going to be a permanent fixture.
For Ollama specifically, point a providers: entry at http://localhost:11434/v1 (Ollama's OpenAI-compatible shim) and reference your model ids from there. The native Ollama provider lives in mcptest-core for callers that want streaming, but the OpenAI-compatible shim is enough for agent tests today.