Verdicts: pass, fail, and inconclusive

Status: the runner contract, the decision rule, and the reporters ship. The run_options.inconclusive policy is parsed and schema-validated; the neutral default is live at the exit code, and the treat_as: fail knob lands with the rest of the run_options wiring. Tracked as epic WOR-1236 and child WOR-1240.

A test has always been pass or fail. That binary breaks down for a non-deterministic probe: a relation that holds in 7 of 10 trials, or a fuzz sweep whose flaky tool sometimes hangs, is neither a clean pass nor a clean fail. Forcing it one way either flakes the suite or hides the signal. The third verdict, inconclusive, names that state directly. The idea comes from AgentAssay (Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows, arXiv:2603.02601).

The decision rule

Given the number of passing trials out of N and a confidence policy, the deterministic rule (mcptest_core::eval::verdict::decide) builds a Wald confidence interval around the pass rate and decides:

pass when the whole interval sits above 0.5 (the probe reliably passes),
fail when it sits below 0.5 (the probe reliably fails),
inconclusive when the interval straddles 0.5, or there were fewer than min_trials trials.

The same counts and policy always yield the same verdict, so a green run stays reproducible. Higher confidence widens the interval, so a borderline probe lands inconclusive more often.

The policy

run_options:
  inconclusive:
    on: true            # opt in; off by default, so every test stays binary
    min_trials: 5       # fewer trials is always inconclusive
    confidence_pct: 95  # two-sided confidence, 80 to 99
    treat_as: neutral   # neutral | fail

treat_as: neutral (the default) keeps an inconclusive verdict out of the failure count, so a flaky probe does not flake CI; it still shows in the summary and an advisory counter so a team can choose to gate on it. treat_as: fail counts it toward the non-zero exit code.

How the reporters render it

Inconclusive is a first-class status in the run envelope and every reporter:

Reporter	Rendering
pretty	a distinct `INCONC` mark, and `, N inconclusive` in the summary when present
junit	`<skipped message="inconclusive"/>`
tap	`ok N - <desc> # TODO inconclusive`
ndjson / json	`status: "inconclusive"` (round-trips through `mcptest report`)
gitlab	an `info`-severity entry
html	an `INCONC` badge

What it does not do

An inconclusive verdict is a statement about confidence, not correctness. It does not tell you why the probe was undecided, only that the trials did not settle it. Raise min_trials to collect more evidence, or tighten the probe so it is deterministic and the question does not arise.