Scenario 13: tool overload and selection under noise

A model that picks the right tool from a list of three may not pick it from a list of twenty. Real MCP deployments rarely present a clean catalog: the candidate list is padded with similar-sounding tools, and a capable agent has to find the one tool that actually does the job amid the look-alikes. This scenario measures that directly, using the hosted test server.

The hosted endpoint https://test.mcptest.sh/mcp?scenario=distractors serves one real tool, get_forecast(city), buried among n near-duplicate decoys. Only get_forecast returns a real forecast; every decoy errors with JSON-RPC -32601 (method not found) when called. The real tool is placed last in the catalog, so an agent that always grabs the first plausible match cannot win by position. Raise n and you chart how selection accuracy decays as the catalog grows.

This is an agent test: an agents: block drives a real model through the tool-use loop, and you assert on the trace, that the model called get_forecast and not a decoy. Because a real model is in the loop, this step needs a provider API key (for example ANTHROPIC_API_KEY). There is no offline stub for measuring selection under noise; the whole point is what the model actually does.

The YAML

Save this as tests/tool-overload.yml:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  weather:
    url: "https://test.mcptest.sh/mcp?scenario=distractors&n=8"

agents:
  - name: forecast survives 8 decoys
    model: claude-sonnet-4-5
    servers: [weather]
    runs: 4
    prompt: What is the weather forecast for Sacramento?
    expect:
      - target: tool_calls[0].name
        matcher: { exact: get_forecast }
      - target: tool_names
        matcher:
          contains-all: [get_forecast]
      - target: tool_calls[0].args.city
        matcher: { regex: "(?i)sacramento" }

What is happening here:

The weather server points at the hosted distractors endpoint with n=8, so the model is shown get_forecast plus eight near-duplicate decoys. n is clamped to the range 1..20; values outside that range are pulled back to the nearest bound.
The prompt asks for a forecast in plain language. It never names a tool. The model has to read the catalog and choose.
tool_calls[0].name asserts the very first tool the model reached for was get_forecast, not a decoy. This is the strict form: it fails if the model probes a decoy first.
tool_names with contains-all: [get_forecast] is the lenient form: it passes as long as get_forecast appears somewhere in the trajectory, tolerating a model that probes a decoy, gets the -32601 error, and recovers. Keep one or the other depending on whether you are grading first-pick accuracy or eventual success.
tool_calls[0].args.city confirms the model passed the city through. A model that picks the right tool but drops the argument is still a failure worth catching.
runs: 4 repeats the agent four times. Model tool choice is not perfectly deterministic, so a single pass can pass or fail by luck; averaging over a few runs gives a steadier signal.

If you also want the objective selection metric, add a tool_selection floor via the equal-function-set gate documented in docs/tool-selection-f1.md. Group the real tool into a one-member class and gate tool_selection.f1 so the report carries a number you can track over time, with no extra model calls.

Run it (note the provider key)

This is an agent run, so it dispatches a real model and needs a key for that model's provider:

ANTHROPIC_API_KEY=sk-ant-... mcptest run --config tests/tool-overload.yml

The model id claude-sonnet-4-5 auto-detects the Anthropic family and reads ANTHROPIC_API_KEY. Swap the model for a gpt-, gemini-, or mistral- id and set the matching key (OPENAI_API_KEY, GEMINI_API_KEY, MISTRAL_API_KEY) to measure a different model. If the key is missing, the run does not error: it falls back to a deterministic stub, which does not exercise real tool selection, so the result is meaningless for this scenario. Set the key.

To compare several models in one pass, list them under models: or run a --models sweep, and the report renders a per-model grid. See docs/models.md for the matrix form.

Sweep the catalog size (raise n)

The interesting result is not whether the model wins at n=8; it is how fast accuracy falls as the catalog grows. Run the same suite at several values of n and read the trend. The cleanest way is to keep n in a variable and override it per run:

servers:
  weather:
    url: "https://test.mcptest.sh/mcp?scenario=distractors&n=${n}"

variables:
  n:
    default: "8"

# Walk the catalog from a clean baseline up to the clamp ceiling.
for n in 1 4 8 16 20; do
  echo "=== n=$n ==="
  ANTHROPIC_API_KEY=sk-ant-... mcptest run \
    --var n=$n \
    --config tests/tool-overload.yml
done

n=1 is the easy baseline: one real tool, one decoy. Each step up crowds the catalog with more look-alikes. A robust model holds its first-pick accuracy as n climbs to the 20 ceiling; a brittle one starts probing decoys and the tool_calls[0].name assertion begins to fail. Plot the pass rate against n and you have the accuracy-decay curve the tool-overload benchmarks report (see docs/distractor-tools.md for the background and the offline, model-free scoring variant).

Expected output

A run with the key set, at n=8, where the model picks the real tool on every repeat:

mcptest run --config tests/tool-overload.yml

  PASS  forecast survives 8 decoys [claude-sonnet-4-5]    (4 runs, 2.9s)
          tool_calls[0].name      get_forecast
          tool_names              [get_forecast]
          tool_calls[0].args.city Sacramento

1 passed, 0 failed in 3.1s

When the model takes the bait, the strict first-pick assertion fails and names the decoy it chose:

  FAIL  forecast survives 8 decoys [claude-sonnet-4-5]    (4 runs, 3.4s)
          tool_calls[0].name: expected get_forecast, got get_forecast_v2

The decoy get_forecast_v2 is one of the synthesized look-alikes. The model reached for it first; had it called the tool, the server would have returned -32601. The lenient tool_names form would still pass here if the model recovered and called the real get_forecast on a later turn, which is why the two assertions answer different questions.

Troubleshooting

Every run falls back to the stub. No provider key is set, so the runner used the deterministic stub instead of a real model. The stub does not exercise real tool selection. Set ANTHROPIC_API_KEY (or the key matching your model) and run again. mcptest run --print-config lists which provider env vars the suite expects, by name.
The first-pick assertion flakes between runs. Model tool choice is not perfectly repeatable. Raise runs: for a steadier signal, or switch the strict tool_calls[0].name assertion to the lenient tool_names contains-all form if you only care that the model eventually reached the real tool.
n seems capped. n is clamped to 1..20. A value of 0 or 50 is pulled to the nearest bound. The clamp is server-side, so the catalog you get is always within that range regardless of what you request.
The model never connects. The endpoint is an HTTP URL target. If the run hangs on connect, confirm the host is reachable from your network and consider --wait-for-ready so the run polls for the listener before sending any MCP request.
-32601 errors show up in the trace. That is the decoys behaving as designed: every decoy errors with method-not-found when called. Seeing one in tool_results means the model probed a decoy. Whether that is a failure depends on which assertion you gated on.