Oracle-free robustness: an end-to-end walkthrough
Most testing checks one thing: given these arguments, expect this response. That is a golden-output oracle, and it covers the inputs you can write an answer for. It says nothing about the long tail of inputs you cannot, the bad inputs a real client will send, or the non-deterministic probes that never settle cleanly. This guide walks one server through the whole oracle-free robustness surface, in order, so the pieces read as one workflow.
Everything here runs offline against the bundled mock, no model and no key. The umbrella suite is examples/robustness-walkthrough/suite.yml; each feature also has its own focused example and reference doc, linked inline.
The server under test
The mock exposes one search tool with a constrained schema and a static result set, so every gate below has a clean answer:
servers:
catalog:
command: ["mcptest", "mock", "--tools-from", "./server.yml"]
1. The golden-output floor
Start with what you know. For the queries you can write an answer for, assert the shape of the result.
tools:
- name: search is correct, self-consistent, and robust
server: catalog
tool: search
args: { query: "anthropic", limit: 10 }
expect:
- target: result.content
matcher: { schema: { type: array, minItems: 1 } }
This is the floor, not the ceiling. It covers one input. The next four blocks cover everything it cannot.
2. Metamorphic relations (no golden output)
A search has no stable golden output, but it still has properties that must hold: calling twice returns the same result, reordering filters changes nothing, raising a limit never shrinks the result set. Assert the relation, not the answer. See metamorphic-testing.md.
metamorphic:
relations:
- idempotent
- arg_order_insensitive
- case_insensitive
- { monotone: { arg: limit } }
A violation is a reproducible disagreement between two calls, so the gate fails without a flake budget. The targets are metamorphic.relations_checked, metamorphic.violations, and metamorphic.gate_passed.
3. Input fuzzing (bad input fails cleanly)
A robust server survives bad arguments. The fuzzer derives malformed cases from the tool's inputSchema (omit a required field, wrong-type a field, oversize a string, add an unexpected field) and checks each fails cleanly: a JSON-RPC error or a valid result, never a crash, a hang, or a leaked stack trace. The cases are seeded, so the run is reproducible. See input-fuzzing.md.
fuzz:
seed: 1729
cases: 32
Targets: fuzz.cases_run, fuzz.crashes, fuzz.hangs, fuzz.protocol_violations, fuzz.leaks. To fuzz every tool a server exposes without writing a suite, use mcptest fuzz --server-command "...".
4. Negative-path conformance (clean rejection)
Where fuzzing sweeps broadly for crashes, negative-path conformance checks specific, taxonomy-keyed bad requests and asserts the server rejects each with a proper JSON-RPC error rather than accepting it silently. See negative-path.md.
negative_path:
checks: [unknown_tool, missing_required]
Each probe maps to a published MCP fault-taxonomy id, so a finding points back to the literature. The targets are negative_path.checks_run, negative_path.failures, and negative_path.gate_passed.
Run the four blocks above together:
$ mcptest run --config examples/robustness-walkthrough/suite.yml
[PASS] search is correct, self-consistent, and robust
Summary: 1 passed, 0 failed, 0 skipped
5. Catch it earlier: the strict schema lint
The fuzzer and the negative-path probes find under-constrained schemas at runtime. The schema lint finds them statically, which is cheaper. It flags an untyped property (SCH-003, critical), a missing required (SCH-001), an open additionalProperties (SCH-002), and unbounded strings or arrays (SCH-004), and ships an autofix that tightens the schema. See tool-schema-lint.md.
tool_quality:
- name: tool schemas are well constrained
server: catalog
expect:
- target: schema_criticals
matcher: { schema: { maximum: 0 } }
6. Did the run stay in bounds: tool-edge coverage
For an agent run, structural tool-edge coverage gates the tools it must exercise (allowed), must never call (restricted), and the hand-offs it should perform (delegation). A restricted-tool attempt fails the gate and is a security signal. See tool-edge-coverage.md.
agents:
- name: triage agent stays within its allowed tools
model: claude-sonnet-4-5
servers: [catalog]
prompt: Find and summarize the open issues.
tool_edges:
allowed: [search, summarize]
restricted: [delete_repo]
The third verdict: inconclusive
Several of these blocks can run a probe N times and land in a genuinely undecided state. The run_options.inconclusive policy reports that as INCONCLUSIVE rather than forcing a pass or a fail. By default it is neutral (it does not fail the build) and surfaces in the summary. See verdicts.md.
run_options:
inconclusive:
on: true
min_trials: 5
confidence_pct: 95
treat_as: neutral
How the pieces fit
| Layer | Block | Catches |
|---|---|---|
| Golden | expect | a wrong result for an input you know |
| Oracle-free | metamorphic | a tool that is not self-consistent |
| Negative path | fuzz, negative_path | a server that mishandles bad input |
| Static | tool_quality (SCH-NNN) | an under-constrained schema, before runtime |
| Coverage | tool_edges | an agent that left its declared bounds |
| Verdict | run_options.inconclusive | a non-deterministic probe, reported honestly |
Start with the golden floor for the inputs you know, add metamorphic relations for the ones you do not, fuzz and probe the negative path, lint the schema to catch faults before they run, and gate an agent's tool edges. Together they cover the surface a single golden-output assertion leaves open.