Tool description quality

Why this exists: agents pay tokens for every tool description they read, and bad descriptions silently degrade tool-selection accuracy. The research backing this is in docs/research-references.md (Hasan et al., arXiv:2602.14878; and arXiv:2602.18914).

There are two surfaces. The tool_quality: block gates a server's catalog in a test suite, with one PASS/FAIL row and a non-zero exit when the bar is missed. The mcptest doctor --lint-descriptions lint emits per-tool findings for an interactive read. They share the same heuristics.

Gate on it: the `tool_quality:` block

A top-level tool_quality: list connects to a server, scores every tool description with the deterministic TDQS heuristics, and gates on the result:

servers:
  filesystem:
    command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]

tool_quality:
  - name: "tool descriptions meet the quality bar"
    server: filesystem
    expect:                          # optional; defaults apply if omitted
      - target: min_score
        matcher: { schema: { minimum: 0.50 } }
      - target: mean_score
        matcher: { schema: { minimum: 0.70 } }
      # fail on any research-backed critical finding (the default, made explicit)
      - target: critical_count
        matcher: { schema: { maximum: 0 } }
      # power user: gate one tool by name
      - target: tool["read_file"].score
        matcher: { schema: { minimum: 0.60 } }

The targets:

min_score: the worst tool's score, 0..1. One badly written tool drags the catalog, so this is the "no tool falls below the floor" check.
mean_score: the average tool score, 0..1.
tool["<name>"].score: one named tool's score, for gating a critical tool harder than the catalog average.
critical_count: how many DESC-NNN findings landed at Critical severity across the catalog. A critical means an agent will likely struggle to use the tool reliably, and one bad tool can carry a critical while the averaged scores still look fine, so this is the "no tool trips a research-backed critical rule" check.
warning_count: how many DESC-NNN findings landed at Warning severity. Reported for tuning; it is not in the default gate, since warnings are workable.

Omit expect: and the engine applies the sane defaults: min_score >= 0.50, mean_score >= 0.70, and critical_count <= 0 (any critical finding fails the gate). Each entry emits one PASS/FAIL row and a failing entry exits non-zero like any test. An unreachable server is a run error, the same as for a tool test.

critical_count and mean_score watch different things. The scores are an averaged 0..1 signal over six heuristics, so a single tool that duplicates its name (DESC-003) or ships a one-word description (DESC-001) can still leave the mean above 0.70. The lint count catches that tool by severity rather than by average, which is why it is in the default gate. The two rules together fail both a quietly mediocre catalog and a single research-backed critical.

The scoring is deterministic (description presence and length, parameter documentation, conciseness, return-format and annotation signals), so a CI gate is reproducible. The LLM quality advisory is intentionally out of scope for the gate; it needs an API key, and a gate should stay deterministic.

The worked suite is examples/tool-quality.yml.

The lint: `mcptest doctor --lint-descriptions`

For an interactive read rather than a gate, the doctor lint runs twelve rules against the tool catalog returned by tools/list and emits findings. Each finding has a severity (Pass, Warning, Critical), a stable rule ID, a message, and an optional suggestion.

The twelve rules

Rule ID	What it catches
DESC-001	Description is empty or under 20 chars
DESC-002	Description over 500 chars (probably missing inputSchema constraints)
DESC-003	Description equals the tool name (no signal)
DESC-004	Description contains no common verb (agent cannot categorize)
DESC-005	Description uses positional phrases ("see above", "previous tool") that mean nothing in a flat catalog
DESC-006	A required argument has no description
DESC-007	An enum-typed argument has a description but does not mention the enum values
DESC-008	An argument description is longer than the tool description (inverted information density)
DESC-009	A tool with a non-trivial input schema provides no usage examples
DESC-010	The description states an action but never says what the tool returns
DESC-011	An annotation hint (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) is present with a non-boolean value
DESC-012	A tool declares no annotations object at all
DESC-013	A string argument lists its allowed values in the description but declares no `enum`

Severities:

Critical: violates a research-backed quality rule. The agent will likely struggle to use this tool reliably.
Warning: imperfect but workable. Tool-selection accuracy may suffer at the margins.
Pass: meets the quality bar for that rule.

A clean tool (no rule fires) gets one synthetic Pass finding so the report always represents every checked tool.

DESC-009: usage examples for non-trivial parameters

Anthropic's advanced-tool-use guidance reports that adding worked examples raised accuracy from 72% to 90% on complex parameter handling, and recommends 1-5 concise examples per tool. DESC-009 fires (Warning) when a tool has a non-trivial input schema yet ships no examples.

A schema counts as non-trivial when it has more than one property, or a single property that is required or not a plain string. Trivial cases (one optional free-text string) are exempt because a worked example adds little there.

The rule treats a tool as documented if it carries an examples array on the tool object, or any parameter declares an examples array, a singular example, or a default value an agent can copy. Accepting all of these keeps false positives low: a tool that documents even one parameter by example passes.

DESC-010: documenting the return format

A description like "Execute order query" or "Run the search" tells the agent what the tool does but not what it gets back, so the agent cannot plan the next step or shape the call. Good descriptions say what comes back, for example "Returns a list of order objects, each with id, total, and status."

DESC-010 fires (Warning) when a non-empty description mentions no return/output/result keyword and the tool declares no outputSchema. The heuristic is deliberately coarse: it can tell whether a description gestures at a return shape at all, but not whether the documented shape is accurate or complete. That limit is why it stays at Warning rather than Critical, keeping false positives cheap. Empty descriptions are left to DESC-001 so the two rules do not double-report.

DESC-011: malformed annotation hints

The MCP spec lets a tool carry an annotations object with four optional boolean hints: readOnlyHint, destructiveHint, idempotentHint, and openWorldHint. A client reads these to present and gate the tool, for example to skip a confirmation prompt on a read-only call or to warn before a destructive one. Every hint must be a JSON boolean.

DESC-011 fires (Warning) when one of those hints is present but holds a non-boolean value, such as the string "yes" or a number. This is a conformance check, not a heuristic: it inspects only the JSON type, so it never guesses. It ignores tools that declare no annotations (that case belongs to DESC-012) and it ignores hint fields that are absent, since the spec marks every hint optional.

DESC-012: missing annotations

Annotations help a client present and gate a tool. Without them the client cannot tell whether a call is read-only, destructive, or idempotent, so it falls back to treating every tool the same. DESC-012 fires (Warning) when a tool declares no annotations object.

The check is intentionally shallow. It cannot judge whether the hints would have been accurate, only that none are declared, and it does not require any particular hint, since not every tool needs every hint. An annotations object that is present, even an empty one, passes. That is why the rule stays at Warning rather than Critical.

DESC-013: choices in prose without an enum

When a string argument's description spells out a fixed set of allowed values (for example "the new status, one of open, closed, or pending") but the schema leaves the argument a free string with no enum, the model has to reproduce a valid value from prose. That is a common source of invalid tool calls, since the client cannot constrain or validate the value. DESC-013 fires (Warning) on a string (or untyped) argument whose schema declares no enum and whose description contains choice language such as one of, allowed values, valid values, options are, either, or a a | b pipe list.

{
  "name": "set_status",
  "inputSchema": {
    "type": "object",
    "properties": {
      "status": {
        "type": "string",
        "description": "The new status, one of open, closed, or pending"
      }
    }
  }
}

The fix is to add an enum: ["open", "closed", "pending"] so the value is constrained. DESC-013 is the inverse of DESC-007, which flags an enum whose values are not documented in the description. The heuristic is deliberately conservative to avoid flagging prose that merely lists examples; if the value set is genuinely open-ended, leave the argument a free string and the rule does not fire.

Gate vs lint

The tool_quality: block and the doctor lint share the same heuristics but serve different jobs. Use tool_quality: in a suite to gate the build: it produces a single score per tool and fails when a tool or the catalog average falls below the floor. Use mcptest doctor --lint-descriptions interactively to see which specific rule each tool tripped, so you know what to fix. Fix the findings the lint reports, then watch the tool_quality: scores climb.

References

: the original eight rules.
: DESC-009 (parameter examples) and DESC-010 (return format).
: DESC-011 (malformed annotation hints) and DESC-012 (missing annotations).
: the TDQS scoring behind the tool_quality: block.
: the tool_quality: check block.
: the critical_count / warning_count lint-finding targets on tool_quality:.
: tool catalog token cost (sibling doctor check; cost vs quality are the two axes).
: research bibliography.
arXiv:2602.14878 (Hasan et al., "MCP Tool Descriptions Are Smelly!").