Scenario 13: tool overload and selection under noise
A model that picks the right tool from a list of three may not pick it from a list of twenty. Real MCP deployments rarely present a clean catalog: the candidate list is padded with similar-sounding tools, and a capable agent has to find the one tool that actually does the job amid the look-alikes. This scenario measures that directly, using the hosted test server.
The hosted endpoint https://test.mcptest.sh/mcp?scenario=distractors serves one real tool, get_forecast(city), buried among n near-duplicate decoys. Only get_forecast returns a real forecast; every decoy errors with JSON-RPC -32601 (method not found) when called. The real tool is placed last in the catalog, so an agent that always grabs the first plausible match cannot win by position. Raise n and you chart how selection accuracy decays as the catalog grows.
This is an agent test: an agents: block drives a real model through the tool-use loop, and you assert on the trace, that the model called get_forecast and not a decoy. Because a real model is in the loop, this step needs a provider API key (for example ANTHROPIC_API_KEY). There is no offline stub for measuring selection under noise; the whole point is what the model actually does.
The YAML
Save this as tests/tool-overload.yml:
# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
weather:
url: "https://test.mcptest.sh/mcp?scenario=distractors&n=8"
agents:
- name: forecast survives 8 decoys
model: claude-sonnet-4-5
servers: [weather]
runs: 4
prompt: What is the weather forecast for Sacramento?
expect:
- target: tool_calls[0].name
matcher: { exact: get_forecast }
- target: tool_names
matcher:
contains-all: [get_forecast]
- target: tool_calls[0].args.city
matcher: { regex: "(?i)sacramento" }
What is happening here:
- The
weatherserver points at the hosted distractors endpoint withn=8, so the model is shownget_forecastplus eight near-duplicate decoys.nis clamped to the range1..20; values outside that range are pulled back to the nearest bound. - The prompt asks for a forecast in plain language. It never names a tool. The model has to read the catalog and choose.
tool_calls[0].nameasserts the very first tool the model reached for wasget_forecast, not a decoy. This is the strict form: it fails if the model probes a decoy first.tool_nameswithcontains-all: [get_forecast]is the lenient form: it passes as long asget_forecastappears somewhere in the trajectory, tolerating a model that probes a decoy, gets the-32601error, and recovers. Keep one or the other depending on whether you are grading first-pick accuracy or eventual success.tool_calls[0].args.cityconfirms the model passed the city through. A model that picks the right tool but drops the argument is still a failure worth catching.runs: 4repeats the agent four times. Model tool choice is not perfectly deterministic, so a single pass can pass or fail by luck; averaging over a few runs gives a steadier signal.
If you also want the objective selection metric, add a tool_selection floor via the equal-function-set gate documented in docs/tool-selection-f1.md. Group the real tool into a one-member class and gate tool_selection.f1 so the report carries a number you can track over time, with no extra model calls.
Run it (note the provider key)
This is an agent run, so it dispatches a real model and needs a key for that model's provider:
ANTHROPIC_API_KEY=sk-ant-... mcptest run --config tests/tool-overload.yml
The model id claude-sonnet-4-5 auto-detects the Anthropic family and reads ANTHROPIC_API_KEY. Swap the model for a gpt-, gemini-, or mistral- id and set the matching key (OPENAI_API_KEY, GEMINI_API_KEY, MISTRAL_API_KEY) to measure a different model. If the key is missing, the run does not error: it falls back to a deterministic stub, which does not exercise real tool selection, so the result is meaningless for this scenario. Set the key.
To compare several models in one pass, list them under models: or run a --models sweep, and the report renders a per-model grid. See docs/models.md for the matrix form.
Sweep the catalog size (raise n)
The interesting result is not whether the model wins at n=8; it is how fast accuracy falls as the catalog grows. Run the same suite at several values of n and read the trend. The cleanest way is to keep n in a variable and override it per run:
servers:
weather:
url: "https://test.mcptest.sh/mcp?scenario=distractors&n=${n}"
variables:
n:
default: "8"
# Walk the catalog from a clean baseline up to the clamp ceiling.
for n in 1 4 8 16 20; do
echo "=== n=$n ==="
ANTHROPIC_API_KEY=sk-ant-... mcptest run \
--var n=$n \
--config tests/tool-overload.yml
done
n=1 is the easy baseline: one real tool, one decoy. Each step up crowds the catalog with more look-alikes. A robust model holds its first-pick accuracy as n climbs to the 20 ceiling; a brittle one starts probing decoys and the tool_calls[0].name assertion begins to fail. Plot the pass rate against n and you have the accuracy-decay curve the tool-overload benchmarks report (see docs/distractor-tools.md for the background and the offline, model-free scoring variant).
Expected output
A run with the key set, at n=8, where the model picks the real tool on every repeat:
mcptest run --config tests/tool-overload.yml
PASS forecast survives 8 decoys [claude-sonnet-4-5] (4 runs, 2.9s)
tool_calls[0].name get_forecast
tool_names [get_forecast]
tool_calls[0].args.city Sacramento
1 passed, 0 failed in 3.1s
When the model takes the bait, the strict first-pick assertion fails and names the decoy it chose:
FAIL forecast survives 8 decoys [claude-sonnet-4-5] (4 runs, 3.4s)
tool_calls[0].name: expected get_forecast, got get_forecast_v2
The decoy get_forecast_v2 is one of the synthesized look-alikes. The model reached for it first; had it called the tool, the server would have returned -32601. The lenient tool_names form would still pass here if the model recovered and called the real get_forecast on a later turn, which is why the two assertions answer different questions.
Troubleshooting
- Every run falls back to the stub. No provider key is set, so the runner used the deterministic stub instead of a real model. The stub does not exercise real tool selection. Set
ANTHROPIC_API_KEY(or the key matching your model) and run again.mcptest run --print-configlists which provider env vars the suite expects, by name. - The first-pick assertion flakes between runs. Model tool choice is not perfectly repeatable. Raise
runs:for a steadier signal, or switch the stricttool_calls[0].nameassertion to the lenienttool_namescontains-allform if you only care that the model eventually reached the real tool. nseems capped.nis clamped to1..20. A value of0or50is pulled to the nearest bound. The clamp is server-side, so the catalog you get is always within that range regardless of what you request.- The model never connects. The endpoint is an HTTP URL target. If the run hangs on connect, confirm the host is reachable from your network and consider
--wait-for-readyso the run polls for the listener before sending any MCP request. -32601errors show up in the trace. That is the decoys behaving as designed: every decoy errors with method-not-found when called. Seeing one intool_resultsmeans the model probed a decoy. Whether that is a failure depends on which assertion you gated on.
See also
docs/distractor-tools.md, the distractor injection model and the offline, model-free selection scoring.docs/tool-selection-f1.md, the objective precision / recall / F1 gate for tool selection.- Previous: Multi-server suites.
- Next: Rate-limit backoff. </content> </invoke>