Oracle-free robustness: an end-to-end walkthrough

Most testing checks one thing: given these arguments, expect this response. That is a golden-output oracle, and it covers the inputs you can write an answer for. It says nothing about the long tail of inputs you cannot, the bad inputs a real client will send, or the non-deterministic probes that never settle cleanly. This guide walks one server through the whole oracle-free robustness surface, in order, so the pieces read as one workflow.

Everything here runs offline against the bundled mock, no model and no key. The umbrella suite is examples/robustness-walkthrough/suite.yml; each feature also has its own focused example and reference doc, linked inline.

The server under test

The mock exposes one search tool with a constrained schema and a static result set, so every gate below has a clean answer:

servers:
  catalog:
    command: ["mcptest", "mock", "--tools-from", "./server.yml"]

1. The golden-output floor

Start with what you know. For the queries you can write an answer for, assert the shape of the result.

tools:
  - name: search is correct, self-consistent, and robust
    server: catalog
    tool: search
    args: { query: "anthropic", limit: 10 }
    expect:
      - target: result.content
        matcher: { schema: { type: array, minItems: 1 } }

This is the floor, not the ceiling. It covers one input. The next four blocks cover everything it cannot.

2. Metamorphic relations (no golden output)

A search has no stable golden output, but it still has properties that must hold: calling twice returns the same result, reordering filters changes nothing, raising a limit never shrinks the result set. Assert the relation, not the answer. See metamorphic-testing.md.

    metamorphic:
      relations:
        - idempotent
        - arg_order_insensitive
        - case_insensitive
        - { monotone: { arg: limit } }

A violation is a reproducible disagreement between two calls, so the gate fails without a flake budget. The targets are metamorphic.relations_checked, metamorphic.violations, and metamorphic.gate_passed.

3. Input fuzzing (bad input fails cleanly)

A robust server survives bad arguments. The fuzzer derives malformed cases from the tool's inputSchema (omit a required field, wrong-type a field, oversize a string, add an unexpected field) and checks each fails cleanly: a JSON-RPC error or a valid result, never a crash, a hang, or a leaked stack trace. The cases are seeded, so the run is reproducible. See input-fuzzing.md.

    fuzz:
      seed: 1729
      cases: 32

Targets: fuzz.cases_run, fuzz.crashes, fuzz.hangs, fuzz.protocol_violations, fuzz.leaks. To fuzz every tool a server exposes without writing a suite, use mcptest fuzz --server-command "...".

4. Negative-path conformance (clean rejection)

Where fuzzing sweeps broadly for crashes, negative-path conformance checks specific, taxonomy-keyed bad requests and asserts the server rejects each with a proper JSON-RPC error rather than accepting it silently. See negative-path.md.

    negative_path:
      checks: [unknown_tool, missing_required]

Each probe maps to a published MCP fault-taxonomy id, so a finding points back to the literature. The targets are negative_path.checks_run, negative_path.failures, and negative_path.gate_passed.

Run the four blocks above together:

$ mcptest run --config examples/robustness-walkthrough/suite.yml
  [PASS] search is correct, self-consistent, and robust
Summary: 1 passed, 0 failed, 0 skipped

5. Catch it earlier: the strict schema lint

The fuzzer and the negative-path probes find under-constrained schemas at runtime. The schema lint finds them statically, which is cheaper. It flags an untyped property (SCH-003, critical), a missing required (SCH-001), an open additionalProperties (SCH-002), and unbounded strings or arrays (SCH-004), and ships an autofix that tightens the schema. See tool-schema-lint.md.

tool_quality:
  - name: tool schemas are well constrained
    server: catalog
    expect:
      - target: schema_criticals
        matcher: { schema: { maximum: 0 } }

6. Did the run stay in bounds: tool-edge coverage

For an agent run, structural tool-edge coverage gates the tools it must exercise (allowed), must never call (restricted), and the hand-offs it should perform (delegation). A restricted-tool attempt fails the gate and is a security signal. See tool-edge-coverage.md.

agents:
  - name: triage agent stays within its allowed tools
    model: claude-sonnet-4-5
    servers: [catalog]
    prompt: Find and summarize the open issues.
    tool_edges:
      allowed: [search, summarize]
      restricted: [delete_repo]

The third verdict: inconclusive

Several of these blocks can run a probe N times and land in a genuinely undecided state. The run_options.inconclusive policy reports that as INCONCLUSIVE rather than forcing a pass or a fail. By default it is neutral (it does not fail the build) and surfaces in the summary. See verdicts.md.

run_options:
  inconclusive:
    on: true
    min_trials: 5
    confidence_pct: 95
    treat_as: neutral

How the pieces fit

Layer	Block	Catches
Golden	`expect`	a wrong result for an input you know
Oracle-free	`metamorphic`	a tool that is not self-consistent
Negative path	`fuzz`, `negative_path`	a server that mishandles bad input
Static	`tool_quality` (SCH-NNN)	an under-constrained schema, before runtime
Coverage	`tool_edges`	an agent that left its declared bounds
Verdict	`run_options.inconclusive`	a non-deterministic probe, reported honestly

Start with the golden floor for the inputs you know, add metamorphic relations for the ones you do not, fuzz and probe the negative path, lint the schema to catch faults before they run, and gate an agent's tool edges. Together they cover the surface a single golden-output assertion leaves open.