Jury consensus

A jury runs several LLM judges over the same assertion and combines their votes into one verdict. In mcptest the jury is the llm-jury matcher: a shared rubric, a list of jurors, a per-juror pass threshold, and a quorum. Each juror grades the response independently and counts as a pass when its score clears the threshold. The jury passes when the fraction of jurors that passed meets the quorum.

Run this example. examples/llm-jury-consensus/ is a three-juror suite with a two-of-three quorum.

mcptest run --config examples/llm-jury-consensus/tests/jury-consensus.yml

The shape

- target: "result.content[0].text"
  matcher:
    llm-jury:
      rubric: |
        Pass if the summary names the service and the release tag and reads as
        an internal deployment note, not marketing copy.
        Return {"pass": bool, "score": 0.0-1.0, "reason": "<one sentence>"}.
      threshold: 0.7        # per-juror pass line, 0..1, default 0.7
      quorum: 0.67          # fraction of jurors that must pass, default 0.5
      jurors:
        - model: "claude-opus-4-7"
        - model: "gpt-4o"
        - model: "gemini-2.5-pro"
  message: "summary should read as a deployment note across the jury"

The fields:

rubric (required): the grading instructions every juror receives. Have it return strict JSON so the score is machine-readable, for example {"pass": bool, "score": 0.0-1.0, "reason": "<one sentence>"}.
jurors (required): one entry per model. A juror may carry its own threshold to override the default for that juror alone.
threshold (default 0.7): the score a juror's grade must reach to count as a pass.
quorum (default 0.5): the fraction of passing jurors the jury needs to pass. The jury passes when passing / total >= quorum.

Choosing a quorum

The quorum is how you set the consensus policy. There are no named methods; you pick the fraction that matches how contested the criterion is.

Policy	`quorum`	When to use it
Simple majority	`0.5` (default)	Contested, but a majority settles it.
Two of three	`0.67`	One dissenter is fine, two is a warning sign.
Unanimous	`1.0`	Safety, PII (personally identifiable information), or compliance: if any reasonable judge calls the response bad, treat it as bad.
k of n	`k / n`	Any explicit ratio. Four of five is `0.8`.

Worked example: three jurors, quorum: 0.67. Verdicts [pass, pass, fail] give a passing fraction of 2/3 = 0.667, which meets the quorum, so the jury passes. Verdicts [pass, fail, fail] give 1/3 = 0.333, below the quorum, so the jury fails.

Worked example: unanimous, quorum: 1.0. Any single fail drops the passing fraction below 1.0, so one dissent fails the jury. This is the right setting for a property where a false negative is expensive.

A note on ties. Because the rule is fraction >= quorum, a four-juror even split passes at the default quorum: 0.5 (2/4 = 0.5). If you want an even split to fail, raise the quorum above 0.5 (for example 0.6), or use an odd number of jurors.

Put deterministic checks first

List the deterministic assertions (schema, contains) before the jury in the same expect block. They fail fast, so a structurally broken response never spends a juror call. The example suite does this: a schema length check and a contains tag check run before the jury weighs the subjective criterion.

Confidence band and escalation flag

A pass or a fail does not say how much the jury agreed. Two jurors at 0.9 and one at 0.05 still produce a majority pass, but that pass rests on a split the bare verdict hides. The jury attaches two extra fields so a reader can tell a confident verdict from a coin flip.

The confidence field is a band derived from the inter-juror agreement metric (Krippendorff's alpha). It is a relabel of the agreement interpretation, not a new statistic:

agreement interpretation	`confidence`
high agreement (alpha >= 0.8)	`high`
tentative (0.667 <= alpha < 0.8)	`medium`
low agreement (alpha < 0.667)	`low`

The escalate_for_review field is a boolean. It is set when agreement is low (alpha below the 0.667 threshold). When it is set, the jury keeps the verdict the quorum picked but marks it for a human to look at rather than reporting a split as if it were settled. A high-disagreement jury is flagged for review, not averaged away.

Both fields default to absent: confidence is null and escalate_for_review is false when the runner did not compute agreement (for example a single-juror verdict) or when reading a report written before these fields existed. Existing reports parse unchanged.

This follows Dechtiar, Katz, Jaume, and Wang (2025), SSRN 5937996, which treats inter-judge disagreement as a calibrated uncertainty signal and argues for a confidence band on scorecards rather than a bare grade (see research-references.md).

Gate on the jury's analytics

The agreement, the band, and the escalation flag are not just report fields; they are assertable targets. Add an assertion whose target is one of these and the existing matchers gate on it, so a split decision fails the test even when the quorum squeaked through:

target	type	meaning
`jury.agreement`	number `0..1`	Krippendorff's alpha across the jurors.
`jury.confidence`	`high` \| `medium` \| `low`	the confidence band.
`jury.escalate`	bool	true when the jury split badly enough to escalate.

- target: "result.content[0].text"
  matcher:
    llm-jury:
      rubric: "Pass when the explanation is correct and clear."
      jurors:
        - { model: claude-sonnet-4-5 }
        - { model: gpt-5 }
        - { model: gemini-2.5-pro }
      quorum: 0.66
# Gate on the jury's own analytics:
- target: jury.agreement
  matcher: { schema: { minimum: 0.7 } }
- target: jury.confidence
  matcher: { not: { exact: low } }
- target: jury.escalate
  matcher: { exact: false }

These targets read from the one llm-jury assertion in the same test, so the test must contain exactly one jury. A test that uses a jury.* target with zero or several juries is a load error with a clear message. Gating stays opt-in: a test with no jury.* assertion runs the jury exactly as before.

A single-juror jury leaves Krippendorff's alpha undefined, so jury.agreement resolves to null and its assertion is skipped with a note rather than failing on the null. The band and escalation flag still resolve (they read low and true for the undefined-agreement case).

The worked suite is examples/agent-jury-agreement.yml.

Bias signals

A jury can agree for the wrong reason. If one of the jurors is from the same vendor family as the model that wrote the response, that juror may prefer its own family's output, so the agreement it contributes is suspect. The jury exposes two more targets for that case:

target	type	meaning
`jury.same_family`	bool	true when a juror shares a vendor family (anthropic, openai, gemini, mistral) with the generator model under test.
`jury.bias_warning`	string or `null`	a folded warning, set only when the jury hit low agreement AND a same-family juror was present. `null` otherwise.

The generator model is the agent's own model in an agent test. A tool-result jury has no generating model, so jury.same_family reads false there and jury.bias_warning stays null: with nothing to compare against, the engine cannot claim a shared family.

jury.bias_warning is deliberately quiet. It folds the same-family signal with the agreement band and fires only when both land together (a same-family juror on a low-agreement run), the same gate the reporter uses, so the two never disagree. A confident, unanimous jury never trips it even with a same-family juror, because high agreement is not the case the warning is for.

# Keep a vendor-family juror from quietly inflating the agreement: fail when
# a same-family juror lands on a low-agreement split.
- target: jury.same_family
  matcher: { exact: false }
- target: jury.bias_warning
  matcher: { exact: null }

Both targets read from the same single jury in the test as jury.agreement, so the one-jury rule applies to them too. The worked suite is examples/agent-jury-bias.yml.