Scenario 17: one agent test across a model matrix
How this page relates to the model compatibility guide: scenarios are five-minute runnable walkthroughs; guides are the full reference. This page fans one agent test across a
models:list and reads the resulting grid. For the baseline/candidate/diff rollout workflow (mcptest model-compat), the classification model, and the baseline file format, read the guide.
Your server works with the model you developed against. Then a customer connects a different one, and a tool that fired reliably stops firing on a phrasing the new model parses differently. Nothing in your server changed; the verdict still flipped. The cheapest way to catch this class of breakage is to take one agent test you already trust and run it against every model you care about, in one invocation, with one report.
Five minutes, three steps: add a models: list to an existing agent test, run the sweep, read the matrix.
Step 1: start from an agent test you already have
Any agent test works. This one asks for the weather and asserts the model actually called the tool with the right argument, the first-test pattern plus an agent block:
# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
weather:
command: ["./examples/reference-server/weather.sh"]
budget:
per_test_usd_cents: 50
per_suite_usd_cents: 500
agents:
- name: weather query routes to get_weather
model: claude-sonnet-4-5
servers: [weather]
prompt: What is the weather in Sacramento?
max_turns: 3
expect:
- target: tool_calls[0].name
matcher: { exact: get_weather }
- target: tool_calls[0].args.city
matcher: { icontains: sacramento }
Step 2: fan it across models
Swap the single model: for a models: list. The runner dispatches one cell per model; everything else about the test stays the same:
agents:
- name: weather query routes to get_weather
models:
- claude-sonnet-4-5 # ANTHROPIC_API_KEY
- gpt-5 # OPENAI_API_KEY
- gemini-2.5-pro # GEMINI_API_KEY or GOOGLE_API_KEY
servers: [weather]
prompt: What is the weather in Sacramento?
max_turns: 3
expect:
- target: tool_calls[0].name
matcher: { exact: get_weather }
- target: tool_calls[0].args.city
matcher: { icontains: sacramento }
Bare ids are provider-auto-detected from their prefix (claude-*, gpt-*, gemini-*, ...). Set the provider key for each model you want to exercise.
You can also skip the YAML edit entirely: --models fans every agent test in the suite across a comma-separated list, overriding any models: declared in the file:
mcptest run suite.yml --models claude-sonnet-4-5,gpt-5,gemini-2.5-pro
Step 3: read the matrix
With --models, the run defaults to the matrix reporter: one row per test, one column per model, a pass/fail cell with the score at each intersection. For a file you can open or attach to a PR:
mcptest run suite.yml \
--models claude-sonnet-4-5,gpt-5,gemini-2.5-pro \
--reporter matrix --output matrix.html
The grid shows weather query routes to get_weather as a row and the three models as columns. A green cell means that model passed every assertion; a red cell expands a why drill-down listing the failing rows (for example, tool_calls[0].name resolved to nothing because the model answered from memory instead of calling the tool). The summary row gives the per-model pass rate. --reporter matrix-md writes the same grid as a GitHub-flavored Markdown table.
The exit code follows the usual rule: the run fails if any cell fails, so a green run means every model passed the test. Tool, resource, and prompt tests carry no model dimension; they run once and stay out of the grid.
Where to go from here
A passing matrix today does not prove next month's silent model update keeps passing. To gate a rollout on "the new model behaves like the old one", capture a baseline with mcptest model-compat capture and diff a candidate against it; that workflow, with its PASS / DRIFT / FAIL classification, is the model compatibility guide.
See also
- The matrix reporter formats and the re-render path:
docs/matrix-reporter.md. - A fuller multi-model example with budget notes:
examples/agent-matrix.yml. - Previous: Record and replay.
- Next: back to Guides or the model compatibility guide.