mcptest docs GitHub

Choosing an agent scoring method

mcptest ships several deterministic ways to score how an agent uses an MCP server's tools. They overlap, so this page is a map: what each one measures, when to reach for it, the trade-offs, and the research it comes from. Every method here is objective (exact membership or sequence checks, no model in the loop) unless noted, so a fixed trace always scores the same.

All of these are agent-test gates: a runs: count drives N independent runs and the gate folds them into one pass or fail. Pick the smallest method that answers your question; layer a second one only when you need the extra angle.

A terminal session: mcptest eval --explain prints the scoring plan for a rubric suite, the criteria, thresholds, judge model, and judge-call count, without calling a provider or spending a token

At a glance

MethodBlockMeasuresReach for it when
Selection F1equal_function_sets:Did the agent call the right tools (by capability class), counting misses and extrasYou care about precision and recall over a set of acceptable tools, not exact names
Distractor accuracydistractors:Can the agent still pick correctly with N irrelevant or near-duplicate tools in the listYou want robustness under tool overload
Certified accuracy floordistractors: -> certified_lowerThe guaranteed lower bound on accuracy at 95 percent confidenceYou are making a release or safety claim, not a quick check
Trajectory validationtrajectory:Did the recorded calls match an expected sequence under a chosen match modeYou know the call plan and want to assert it (strict, subsequence, subset, ...)
Trajectory axestrajectory_axes:Did the trace respect data-flow (consumer after producer) and ordering edges, ignoring the rest of the pathYou care about which call depends on which, not the exact sequence
Golden-path efficiencygolden_path:Did the agent reach the goal without wasted steps, backtracks, or repeatsYou care about how efficiently the agent worked, not just whether it succeeded
Tool-edge coveragetool_edges:Were only allowed tools called, and were restricted ones avoidedYou have an allow/deny policy or delegation edges to enforce
Name-free discoverydiscovery:Can the agent find the tool from intent alone, with no tool named in the promptYou want to test discovery, not follow-the-name
Pass^k stabilitystability:How consistent the agent is across runs (weakest run, variance) plus cross-run reproducibility (tool-sequence similarity, argument consistency, early divergence)Flakiness matters as much as the average; you care whether the runs took the same path
Model-gradedeval: rubric / llm-juryOpen-ended response quality a deterministic check cannot expressThe output is prose or a judgment call

A worked example

Each method is a block under an agent. They compose, so a single agent can carry several gates. This one layers the two most common: did the right call happen (trajectory:), and did it happen without wasted steps (golden_path:)?

agents:
  - name: weather lookup is correct and efficient
    model: claude-sonnet-4-5
    servers: [weather]
    prompt: What is the weather in Sacramento?
    max_turns: 3
    max_tokens: 256
    # Did the right call happen, in an acceptable order?
    trajectory:
      mode: subsequence
      calls:
        - name: get_weather
          args:
            subset: { city: Sacramento }
    # Did it happen without backtracks or repeats?
    golden_path:
      calls: [get_weather]

Record the run once with --record, then every replay scores the same trace offline and free. The rest of this page is the map for picking which blocks to add.

The deterministic selection methods

Selection F1 (equal_function_sets:) scores tool choice against capability classes: any member of a class counts as the right call, so the agent is not punished for picking web.search over google.search when both satisfy the task. It reports precision, recall, and F1, so a miss (needed tool not called) and an extra (unneeded tool called) are both visible. Use it when "the right tool" is really "any tool in this set." From MSC-Bench.

Distractor accuracy (distractors:) pads the candidate list with N irrelevant or near-duplicate tools and asks whether the agent still selects correctly. It is the tool-overload axis: a model that is perfect on a clean list can fall apart when the list is crowded. Use it to measure robustness, not just correctness. From MCPAgentBench (arXiv:2512.24565) and MCP-Atlas (arXiv:2602.00933). See Distractor tools and tool-overload scoring.

Certified accuracy floor (distractors.certified_lower) turns the distractor pass rate into a guarantee. A point estimate hides tail risk; the Clopper-Pearson exact lower bound at 95 percent confidence is the floor the true accuracy clears given the runs you have. It is conservative by design and needs a real sample (one perfect run certifies only about 5 percent), so raise runs: before asserting a high floor. Reach for it on a release or safety claim. From LLMCert-T (arXiv:2510.03992), which shows clean accuracy near 75 percent collapsing to a certified bound near 0.20 under distractor saturation. The exact bound is used rather than the normal-approximation (Wald) interval because Wald is badly miscalibrated at the extremes, exactly where distraction pushes accuracy.

The trajectory methods

Trajectory validation (trajectory:) checks the recorded call sequence against an expected one under a match mode: strict (exact, in order), subsequence (in order, extras allowed), unordered, subset (no over-calling), or superset. Use it when you know the call plan and want to assert it. See Offline trace validation.

Trajectory axes (trajectory_axes:) drops the call-list pin and scores only two ordering constraints: dependency satisfaction (a consumer ran after the producer whose output it reads) and order satisfaction (one call preceded another). Each is a 0..=100 percent over the edges you declare. Use it when the agent may legitimately add steps, retry, or reorder unrelated calls, and the only thing that must hold is the data-flow: fetch after the search that found the URL, authenticate before any query. Where trajectory: would flag those extra or reordered steps as mismatches, trajectory_axes: passes any order-respecting trace. See Offline trace validation.

Golden-path efficiency (golden_path:) scores the same trace for waste: extra steps beyond the ideal path, backtracks, and repeated tools, folded into a penalty multiplier. Trajectory asks "did the right calls happen?"; golden-path asks "did they happen efficiently?". Layer it on trajectory when cost or latency of the agent's exploration matters.

Tool-edge coverage (tool_edges:) enforces an allow/deny policy: which tools may be called, which are restricted, and which delegation hand-offs are allowed. Use it for a least-privilege or trust-boundary assertion rather than a correctness one. See Tool-edge coverage.

When a deterministic check is not enough

The methods above are exact, so prefer them whenever the thing you are checking can be expressed as membership or sequence. When the output is open-ended (prose quality, a judgment call, faithfulness), reach for the model-graded surface: an eval: rubric, an llm-judge matcher, or an llm-jury panel. Those cost provider calls and carry model variance, so cap and stabilize them (runs:, a jury quorum, and a jury max_cost) and keep a deterministic floor alongside where you can.