mcptest docs GitHub

Reliability reporting beyond pass^k

The headline reliability metrics stay pass@k and pass^k:

Reliability reporting adds two deterministic, model-free helpers on top. Both are pure arithmetic, so a fixed input always yields the same result, and neither calls a model. The curves below are secondary detail that explains the headline; they do not replace it.

Power-analysis run-count recommendation

How many runs N do you need so the estimated pass rate has a confidence interval no wider than a target half-width? Spelling out the terms:

The recommendation uses the standard normal-approximation (Wald) interval for a binomial proportion. For an observed pass rate p over N runs, the half-width is:

half_width = z * sqrt(p * (1 - p) / N)

z is the standard-normal critical value for the confidence level: 1.96 for 95 percent, 1.645 for 90 percent, 2.576 for 99 percent. The product p * (1 - p) is largest at p = 0.5 (worst case, value 0.25), so solving at p = 0.5 guarantees the half-width for any observed rate:

N = ceil( (z / half_width)^2 * 0.25 )

Given instead a chosen N, the worst-case half-width that run budget buys is:

half_width = z * sqrt(0.25 / N)

Worked example

For a 5 percent half-width at 95 percent confidence:

N = ceil( (1.96 / 0.05)^2 * 0.25 ) = ceil(384.16) = 385 runs

Run 100 times instead and the worst-case half-width is:

half_width = 1.96 * sqrt(0.25 / 100) = 1.96 * 0.05 = 0.098, about 10 percent

So 100 runs pin the pass rate to plus or minus 10 percent, and 385 runs tighten it to plus or minus 5 percent.

Reliability-decay summary

Given the N per-run pass/fail outcomes of one multi-run test, three secondary statistics, each spelled out in full.

Reliability Decay Curve

The cumulative pass^k as k goes 1..=N, reported as a small vector of integer percents. Entry k is the empirical pass^k over the first k runs:

entry_k = (c_k / k)^k * 100

where c_k is the count of passes among the first k runs. This is the maximum-likelihood estimate that all k independent runs pass given the observed rate. It decays as failures accumulate.

Example, outcomes pass, pass, pass, fail:

k=1: (1/1)^1 = 100
k=2: (2/2)^2 = 100
k=3: (3/3)^3 = 100
k=4: (3/4)^4 = 0.316 -> 31
curve = [100, 100, 100, 31]

Variance Amplification Factor

The observed spread of the per-run pass indicator, scaled to 0..=100. A stable run (all pass or all fail) has zero spread; a maximally flaky run (half pass, half fail) has the most. It is the population standard deviation of the per-run pass indicators (each 0 or 1) divided by its theoretical maximum 0.5, expressed as an integer percent:

variance_amplification = round( std_dev / 0.5 * 100 )

Example: outcomes pass, fail, pass, fail split evenly, so std_dev = 0.5 and the factor is 100. Outcomes pass, pass, pass, pass have std_dev = 0 and a factor of 0.

Graceful Degradation Score

A 0..=100 score that is 100 when every run passed and decays as failures cluster late rather than early. Each run's pass is weighted by its position (1..=N), so a late failure costs more than an early one (an agent that degrades late is less graceful than one that fails fast and recovers):

graceful = round( 100 * sum(position_i * pass_i) / sum(position_i) )

Example with 4 runs and one failure (same pass count, different position):

fail early (fail, pass, pass, pass): 100 * (2+3+4) / (1+2+3+4) = 900/10 = 90
fail late  (pass, pass, pass, fail): 100 * (1+2+3) / (1+2+3+4) = 600/10 = 60

The late failure scores lower, as intended. All-pass scores 100, all-fail scores 0.

Targets

The summary exposes these assertable targets through the same dot-path resolver the other eval reports use:

Surface and runner emission

These helpers ship as a pure library surface in mcptest-core (eval::reliability): recommend_runs, confidence_band, summarize, and ReliabilitySummary with resolve_target and metrics_object. They have no model in the loop and no I/O. Wiring them into a suite gate and emitting the summary from the runner is preview, not yet a stable suite block.