Verdicts: pass, fail, and inconclusive
Status: the runner contract, the decision rule, and the reporters ship. The
run_options.inconclusivepolicy is parsed and schema-validated; the neutral default is live at the exit code, and thetreat_as: failknob lands with the rest of therun_optionswiring. Tracked as epic WOR-1236 and child WOR-1240.
A test has always been pass or fail. That binary breaks down for a non-deterministic probe: a relation that holds in 7 of 10 trials, or a fuzz sweep whose flaky tool sometimes hangs, is neither a clean pass nor a clean fail. Forcing it one way either flakes the suite or hides the signal. The third verdict, inconclusive, names that state directly. The idea comes from AgentAssay (Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows, arXiv:2603.02601).
The decision rule
Given the number of passing trials out of N and a confidence policy, the deterministic rule (mcptest_core::eval::verdict::decide) builds a Wald confidence interval around the pass rate and decides:
- pass when the whole interval sits above 0.5 (the probe reliably passes),
- fail when it sits below 0.5 (the probe reliably fails),
- inconclusive when the interval straddles 0.5, or there were fewer than
min_trialstrials.
The same counts and policy always yield the same verdict, so a green run stays reproducible. Higher confidence widens the interval, so a borderline probe lands inconclusive more often.
The policy
run_options:
inconclusive:
on: true # opt in; off by default, so every test stays binary
min_trials: 5 # fewer trials is always inconclusive
confidence_pct: 95 # two-sided confidence, 80 to 99
treat_as: neutral # neutral | fail
treat_as: neutral (the default) keeps an inconclusive verdict out of the failure count, so a flaky probe does not flake CI; it still shows in the summary and an advisory counter so a team can choose to gate on it. treat_as: fail counts it toward the non-zero exit code.
How the reporters render it
Inconclusive is a first-class status in the run envelope and every reporter:
| Reporter | Rendering |
|---|---|
| pretty | a distinct INCONC mark, and , N inconclusive in the summary when present |
| junit | <skipped message="inconclusive"/> |
| tap | ok N - <desc> # TODO inconclusive |
| ndjson / json | status: "inconclusive" (round-trips through mcptest report) |
| gitlab | an info-severity entry |
| html | an INCONC badge |
What it does not do
An inconclusive verdict is a statement about confidence, not correctness. It does not tell you why the probe was undecided, only that the trials did not settle it. Raise min_trials to collect more evidence, or tighten the probe so it is deterministic and the question does not arise.