Distractor tools and tool-overload scoring

A real MCP deployment rarely presents the agent with exactly the tools a task needs. The candidate list is padded with irrelevant or near-duplicate tools, and a capable agent must still pick the correct one amid the noise. This page explains how to inject N distractor tools into a scenario and assert that the agent still selects correctly. The scoring is objective and runs no model: a choice is correct when its id is in the declared correct set, and a choice is a distractor when its id is one of the injected distractors.

This is the tool-overload setup from MCPAgentBench (arXiv:2512.24565) and MCP-Atlas (arXiv:2602.00933), which inject distractor tools and report selection accuracy as a function of distractor count.

Two distractor sources

The distractors: block declares how many distractors to inject and where they come from. There are two sources:

from: catalog draws from a bundled catalog of plausible-but-irrelevant tools (weather, currency, calendar, and such). Use it to pad the candidate list with tools that are realistic but clearly unrelated to the task.
from: near_duplicate synthesizes name-collision variants of the real tools named in of:. Each named tool yields several deterministic variants by casing, pluralization, and suffix (for example a _v2 or _internal variant). Use it to test whether the agent can tell the real tool from a look-alike. of: is required when from is near_duplicate.

Accuracy as a function of distractor count

The point of the setup is the degradation curve: selection accuracy tends to fall as the distractor count rises. Setting count: 0 is the baseline, no noise. Raising count injects more look-alikes and irrelevant tools, and a robust agent holds its accuracy as the count climbs. Running the same scenario at several counts traces the accuracy-vs-distractor-count curve the benchmarks report.

Serial vs parallel complexity

The optional complexity: tag stratifies the reporting by invocation complexity, the MCPAgentBench axis. It is descriptive metadata and does not change the accuracy math; it lets a report group degradation curves by complexity:

serial: the correct tools must be invoked in sequence, one depends on another (search, then fetch the first result).
parallel: the correct tools are independent and can be invoked in one turn.

The accuracy rule

Given the agent's chosen tool ids, the set of correct ids, and the set of injected distractor ids, the scorer counts, per chosen id:

a correct hit when the id is in the correct set;
a distractor hit when the id is in the distractor set;
everything else is ignored (an id that is neither correct nor an injected distractor is out of scope for this metric).

Matching is exact membership of the chosen id; there is no judge model. Accuracy is the integer percent chose_correct * 100 / chose_in_scope, where chose_in_scope = chose_correct + chose_distractor. A run that chose nothing in scope scores 0 (it never selected the correct tool), except the vacuous case of zero correct ids declared, which scores 100. The block exposes two assertable targets:

distractors.accuracy: the selection accuracy percent, 0 to 100.
distractors.chose_distractor: how many injected distractors were chosen, the headline failure signal.

An empty or omitted expect: applies the default gate distractors.accuracy >= 50.

Inline example suite

This is from examples/distractor-tools/. One mock server, catalog, exposes real tools; the distractors: block pads the candidate list and asserts the agent still picks the right one. The server is served by mcptest mock from a static manifest, so the run is deterministic and offline (see mcptest mock for the manifest format).

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
  catalog:
    command: ["mcptest", "mock", "--tools-from", "./servers/catalog.yml"]

agents:
  - name: resists bundled distractors
    model: claude-sonnet-4-5
    servers: [catalog]
    runs: 4
    prompt: Find products that match the keyword "notebook".
    distractors:
      count: 4
      source:
        from: catalog
      correct: [catalog.search_products]
      complexity: serial
      expect:
        - target: distractors.accuracy
          matcher:
            schema: { minimum: 80 }
        - target: distractors.chose_distractor
          matcher:
            schema: { maximum: 0 }

A near-duplicate scenario derives look-alikes from the real tool names:

    distractors:
      count: 3
      source:
        from: near_duplicate
        of: [search_products, get_product]
      correct: [catalog.search_products, catalog.get_product]
      complexity: parallel

The referenced manifest servers/catalog.yml is a mock_server block:

mock_server:
  name: catalog
  tools:
    - name: search_products
      description: Search the product catalog by keyword.
      input_schema:
        type: object
        required: [query]
        properties:
          query: { type: string }
      response:
        content:
          - type: text
            text: "Products matching ${args.query}: sku-1, sku-2."
    - name: get_product
      description: Get one product by sku.
      input_schema:
        type: object
        required: [sku]
        properties:
          sku: { type: string }
      response:
        content:
          - type: text
            text: "Product ${args.sku}: in stock."

Each expect: item is the standard assertion shape: a target: paired with a matcher:. A floor on the accuracy percent is matcher: { schema: { minimum: N } }; a ceiling on the distractor count is matcher: { schema: { maximum: N } }.

Objective, no model

The selection scoring is exact membership of the chosen ids against the correct and distractor sets, so it is byte-stable and free of model calls. The scoring math, the bundled catalog, and the near-duplicate synthesizer are exercised offline by the core distractors eval module test.

The distractors: block is marked x-mcptest-status: preview in the schema: the engine scores a recorded selection today, and the runtime injection of the distractors into the presented tool list is preview-stage runner wiring.

Tool-selection F1 via equal-function sets: a sibling objective selection gate that scores against capability classes.
Name-free discovery and orchestration diagnostics: the intent-only selection setup.
mcptest mock: the deterministic mock server the example targets.