mcptest docs GitHub

Verdicts: pass, fail, and inconclusive

Status: the runner contract, the decision rule, and the reporters ship. The run_options.inconclusive policy is parsed and schema-validated; the neutral default is live at the exit code, and the treat_as: fail knob lands with the rest of the run_options wiring. Tracked as epic WOR-1236 and child WOR-1240.

A test has always been pass or fail. That binary breaks down for a non-deterministic probe: a relation that holds in 7 of 10 trials, or a fuzz sweep whose flaky tool sometimes hangs, is neither a clean pass nor a clean fail. Forcing it one way either flakes the suite or hides the signal. The third verdict, inconclusive, names that state directly. The idea comes from AgentAssay (Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows, arXiv:2603.02601).

The decision rule

Given the number of passing trials out of N and a confidence policy, the deterministic rule (mcptest_core::eval::verdict::decide) builds a Wald confidence interval around the pass rate and decides:

The same counts and policy always yield the same verdict, so a green run stays reproducible. Higher confidence widens the interval, so a borderline probe lands inconclusive more often.

The policy

run_options:
  inconclusive:
    on: true            # opt in; off by default, so every test stays binary
    min_trials: 5       # fewer trials is always inconclusive
    confidence_pct: 95  # two-sided confidence, 80 to 99
    treat_as: neutral   # neutral | fail

treat_as: neutral (the default) keeps an inconclusive verdict out of the failure count, so a flaky probe does not flake CI; it still shows in the summary and an advisory counter so a team can choose to gate on it. treat_as: fail counts it toward the non-zero exit code.

How the reporters render it

Inconclusive is a first-class status in the run envelope and every reporter:

ReporterRendering
prettya distinct INCONC mark, and , N inconclusive in the summary when present
junit<skipped message="inconclusive"/>
tapok N - <desc> # TODO inconclusive
ndjson / jsonstatus: "inconclusive" (round-trips through mcptest report)
gitlaban info-severity entry
htmlan INCONC badge

What it does not do

An inconclusive verdict is a statement about confidence, not correctness. It does not tell you why the probe was undecided, only that the trials did not settle it. Raise min_trials to collect more evidence, or tighten the probe so it is deterministic and the question does not arise.