Reliability reporting beyond pass^k
The headline reliability metrics stay pass@k and pass^k:
- pass@k (optimistic): at least one of k runs passed.
- pass^k (pessimistic): every one of k runs passed.
Reliability reporting adds two deterministic, model-free helpers on top. Both are pure arithmetic, so a fixed input always yields the same result, and neither calls a model. The curves below are secondary detail that explains the headline; they do not replace it.
Power-analysis run-count recommendation
How many runs N do you need so the estimated pass rate has a confidence interval no wider than a target half-width? Spelling out the terms:
- Confidence interval: the range that, at a stated confidence level, brackets the true pass rate.
- Half-width: half the full width of that interval. A half-width of 5 percent means plus or minus 5 percent.
- Confidence level: how often the interval would contain the true rate over many repeats (90, 95, or 99 percent here).
The recommendation uses the standard normal-approximation (Wald) interval for a binomial proportion. For an observed pass rate p over N runs, the half-width is:
half_width = z * sqrt(p * (1 - p) / N)
z is the standard-normal critical value for the confidence level: 1.96 for 95 percent, 1.645 for 90 percent, 2.576 for 99 percent. The product p * (1 - p) is largest at p = 0.5 (worst case, value 0.25), so solving at p = 0.5 guarantees the half-width for any observed rate:
N = ceil( (z / half_width)^2 * 0.25 )
Given instead a chosen N, the worst-case half-width that run budget buys is:
half_width = z * sqrt(0.25 / N)
Worked example
For a 5 percent half-width at 95 percent confidence:
N = ceil( (1.96 / 0.05)^2 * 0.25 ) = ceil(384.16) = 385 runs
Run 100 times instead and the worst-case half-width is:
half_width = 1.96 * sqrt(0.25 / 100) = 1.96 * 0.05 = 0.098, about 10 percent
So 100 runs pin the pass rate to plus or minus 10 percent, and 385 runs tighten it to plus or minus 5 percent.
Reliability-decay summary
Given the N per-run pass/fail outcomes of one multi-run test, three secondary statistics, each spelled out in full.
Reliability Decay Curve
The cumulative pass^k as k goes 1..=N, reported as a small vector of integer percents. Entry k is the empirical pass^k over the first k runs:
entry_k = (c_k / k)^k * 100
where c_k is the count of passes among the first k runs. This is the maximum-likelihood estimate that all k independent runs pass given the observed rate. It decays as failures accumulate.
Example, outcomes pass, pass, pass, fail:
k=1: (1/1)^1 = 100
k=2: (2/2)^2 = 100
k=3: (3/3)^3 = 100
k=4: (3/4)^4 = 0.316 -> 31
curve = [100, 100, 100, 31]
Variance Amplification Factor
The observed spread of the per-run pass indicator, scaled to 0..=100. A stable run (all pass or all fail) has zero spread; a maximally flaky run (half pass, half fail) has the most. It is the population standard deviation of the per-run pass indicators (each 0 or 1) divided by its theoretical maximum 0.5, expressed as an integer percent:
variance_amplification = round( std_dev / 0.5 * 100 )
Example: outcomes pass, fail, pass, fail split evenly, so std_dev = 0.5 and the factor is 100. Outcomes pass, pass, pass, pass have std_dev = 0 and a factor of 0.
Graceful Degradation Score
A 0..=100 score that is 100 when every run passed and decays as failures cluster late rather than early. Each run's pass is weighted by its position (1..=N), so a late failure costs more than an early one (an agent that degrades late is less graceful than one that fails fast and recovers):
graceful = round( 100 * sum(position_i * pass_i) / sum(position_i) )
Example with 4 runs and one failure (same pass count, different position):
fail early (fail, pass, pass, pass): 100 * (2+3+4) / (1+2+3+4) = 900/10 = 90
fail late (pass, pass, pass, fail): 100 * (1+2+3) / (1+2+3+4) = 600/10 = 60
The late failure scores lower, as intended. All-pass scores 100, all-fail scores 0.
Targets
The summary exposes these assertable targets through the same dot-path resolver the other eval reports use:
reliability.runsreliability.pass_at_k(surfaced as 0 or 100 so a numeric matcher can gate it)reliability.passhat_kreliability.variance_amplificationreliability.graceful_degradation
Surface and runner emission
These helpers ship as a pure library surface in mcptest-core (eval::reliability): recommend_runs, confidence_band, summarize, and ReliabilitySummary with resolve_target and metrics_object. They have no model in the loop and no I/O. Wiring them into a suite gate and emitting the summary from the runner is preview, not yet a stable suite block.