Rubric scoring

A rubric eval grades a response against criteria you define and returns a score in 0..1, gated by a threshold. You write the rubric in your test YAML under the top-level evals: block; mcptest runs the evaluation and reports a pass or fail per eval. Run them with the mcptest eval subcommand.

This is the same rubric engine the agent-side eval.rubric matcher uses, so a rubric you write here behaves identically to one inside an agents: test.

The three rubric forms

A rubric is one of three shapes. Give a structured rubric either criteria or tree, never both.

1. Free-form string

A single holistic judgment. The judge reads the response and the rubric text and returns one score. threshold (default 0.7) gates the pass.

evals:
  - name: summary stays on topic
    server: remote_api
    prompt: "Summarize the latest deployment."
    rubric: "Answer must mention the service name and the release tag."
    threshold: 0.7

2. Weighted criteria

A list of named criteria, each judged separately. The score is the weight-normalized average of the per-criterion scores, and each criterion is reported with its own reason. weight defaults to 1. Add strict: true to require a perfect score.

evals:
  - name: booking quality
    server: calendar
    prompt: "Book a meeting with Alice next Tuesday at 2pm and confirm it."
    rubric:
      threshold: 0.8
      criteria:
        - name: booked the right day
          description: "Created an event on the correct Tuesday."
          weight: 2
        - name: confirmed to the user
          description: "The final reply confirms the booking."

3. Decision tree

One yes/no question per node. The judge answers each ask, the run descends the yes or no branch (a yes is a judge score of 0.5 or higher), and a score leaf ends the walk. One narrow question per node is easier to judge reliably and to audit than one holistic score; the report shows the path taken.

evals:
  - name: weather answered
    server: weather
    prompt: "What is the weather in Paris?"
    rubric:
      threshold: 0.7
      tree:
        ask: "Did the answer call the get_weather tool?"
        yes:
          ask: "Does the final reply state a temperature?"
          yes: { score: 1.0, reason: "called the tool and reported a temperature" }
          no:  { score: 0.4, reason: "called the tool but gave no temperature" }
        no:    { score: 0.0, reason: "never called the weather tool" }

Reusable rubrics

Define a rubric once under the top-level rubrics: map and reference it from any eval with rubric: { ref: <name> }. An eval may override the named rubric's threshold or strict inline, so one shared definition covers many tests without copy-paste.

rubrics:
  helpful-and-grounded:
    threshold: 0.7
    criteria:
      - name: helpful
        description: "Directly answers the question."
      - name: grounded
        description: "Makes no claim the tools did not support."

evals:
  - name: weather answer is good
    server: weather
    prompt: "What is the weather in Paris?"
    response: "It is 18C and clear in Paris."
    rubric: { ref: helpful-and-grounded }
  - name: strict billing answer
    server: billing
    prompt: "What did invoice 42 total?"
    response: "Invoice 42 totaled $120.00."
    rubric:
      ref: helpful-and-grounded
      threshold: 0.9   # override just for this eval

An unknown ref is a load-time error.

Required and guard criteria

Two per-criterion flags gate the eval independent of the weighted average:

required: true makes a criterion a hard gate. If it scores below the rubric threshold the eval fails, even if the average clears the bar.
guard: true marks a negative criterion: the description states something that must not hold. Its contribution is inverted (a clean response scores 1.0), and if the judge finds the bad thing present the eval fails.

evals:
  - name: safe and grounded
    server: billing
    prompt: "What did invoice 42 total?"
    response: "Invoice 42 totaled $120.00."
    rubric:
      threshold: 0.7
      criteria:
        - name: correct total
          description: "States the correct invoice total."
          required: true                # must hold, or the eval fails
        - name: leaks a card number
          description: "The answer exposes a full card number."
          guard: true                   # must not hold

Calibration anchors

A criterion can carry examples:, labeled sample responses with the score a human would give them. They are appended to that criterion's judge prompt as few-shot anchors, which steers the judge toward your scoring intent and reduces drift between models. Each anchor is a response plus an expected score in 0..1.

evals:
  - name: groundedness with anchors
    server: docs
    prompt: "What changed in the last release?"
    response: "The 2.3.0 release added per-tenant rate limits."
    rubric:
      criteria:
        - name: grounded
          description: "Every claim is supported by the release notes."
          examples:
            - response: "2.3.0 added per-tenant rate limits."
              score: 1.0
            - response: "2.3.0 rewrote the billing engine."   # not in the notes
              score: 0.0

Each anchor renders into the prompt as Response: "..." -> score X.XX under a "Calibration examples" heading, so the judge grades the candidate consistently with the labeled examples.

Evidence-required judging

Set require_evidence: true and every criterion's judge must return a verbatim span from the candidate that justifies its verdict, not just a score and a sentence. The cited span is surfaced in the report so a pass or fail is auditable. A criterion the judge cannot back with evidence scores 0 and gates the eval, so an unjustified verdict cannot slip through.

evals:
  - name: grounded answer with citations
    server: docs
    prompt: "What changed in the last release?"
    response: "The 2.3.0 release added per-tenant rate limits."
    rubric:
      require_evidence: true
      criteria:
        - name: states the version
          description: "Names the release version."
        - name: states the change
          description: "Describes what the release changed."

Evidence is a criteria-mode feature; a decision tree asks yes/no questions and does not request a cited span.

Conditional criteria and per-criterion thresholds

A criterion can carry a when: predicate so it is judged only when it applies, and its own threshold: that overrides the rubric default for its gate. The predicate is deterministic (no model call): contains for a substring or regex for a pattern, matched against the candidate. A criterion whose when: does not hold is skipped entirely and does not enter the aggregate; if every criterion is skipped the eval is a vacuous pass.

evals:
  - name: error responses must apologize
    server: api
    prompt: "Trigger a server error."
    response: "Sorry, something went wrong on our end (error 500)."
    rubric:
      criteria:
        - name: apologizes on error
          description: "Acknowledges the failure and apologizes."
          when: { contains: "error" }   # only graded when the answer mentions an error
          required: true
          threshold: 0.9                 # stricter gate than the rubric default

Score scales and aggregation

The judge always scores a criterion on a 0..1 scale, and that normalized value drives gating, the score-delta gate, and every machine reporter. Two optional fields change how the score combines and how it reads to a person.

aggregation sets how per-criterion scores combine into the rubric score:

weighted_average (default): the weight-normalized mean of the criteria.
min (also worst): the worst criterion caps the score. Use it when one weak dimension should pull the whole grade down regardless of weight.

A tree: rubric always walks the tree, so it takes no aggregation.

scale sets the native units shown in human output. The 0..1 value is unchanged; only the string a person reads changes.

unit (default): the raw 0..1 value, for example 0.80.
likert: { min, max }: an integer band, for example 4.0/5.
boolean: pass or fail.
letter: a letter grade A..F.

evals:
  - name: answer quality, worst-criterion on a 1-5 scale
    server: docs
    prompt: "What changed in the last release?"
    response: "The 2.3.0 release added per-tenant rate limits."
    rubric:
      threshold: 0.7
      aggregation: min          # the worst criterion sets the score
      scale:
        likert: { min: 1, max: 5 }   # shown as e.g. 4.0/5
      criteria:
        - name: accurate
          description: "States the correct change."
        - name: complete
          description: "Mentions every notable change."

An unknown aggregation or scale value is a load-time error.

Judge model and jury

By default the judge model is resolved from the environment, the same way mcptest eval resolves it. A per-eval judge: block overrides the model and, optionally, runs a jury.

evals:
  - name: subjective call, juried
    server: docs
    prompt: "Summarize the release."
    response: "The 2.3.0 release added per-tenant rate limits."
    rubric: "Accurate and complete summary of the release."
    judge:
      model: claude-sonnet-4-5
      jury:
        size: 3            # grade three times
        consensus: 0.66    # pass when at least two of the three pass

A jury grades the rubric size times and passes when at least consensus of those judgments pass; the reported score is the mean and the cost is the sum. OSS juries are single-provider, so they mainly add consensus accounting. The run header prints the projected judge-call count up front (criteria times jury size) so a jury does not surprise the --max-cost budget.

A panel grades the rubric once per model and combines the per-model verdicts. It reduces single-model bias on subjective criteria.

evals:
  - name: subjective call, ensemble
    server: docs
    prompt: "Summarize the release."
    response: "The 2.3.0 release added per-tenant rate limits."
    rubric: "Accurate and complete summary of the release."
    judge:
      panel: [claude-sonnet-4-5, claude-haiku-4-5]
      aggregate: majority   # mean | median | majority
      tie_break: fail        # breaks an even majority split; default fail

aggregate is mean (default) or median of the per-model scores against the threshold, or majority of the per-model passes. The reported score is always the panel mean. Panels run every model through the one resolved provider, so a same-vendor panel works with a single key; mixing distinct vendors in one panel is not supported.

Presets

For common quality dimensions, reference a built-in preset with rubric: { preset: <name> } instead of writing criteria by hand. Override the threshold or strict, and append extra criteria to extend the preset.

evals:
  - name: grounded and brief
    server: docs
    prompt: "What changed in the last release?"
    response: "The 2.3.0 release added per-tenant rate limits."
    rubric:
      preset: groundedness
      threshold: 0.8
      criteria:                 # appended to the preset's own criteria
        - name: brief
          description: "States the answer without filler."

The built-in presets are:

Preset	Judges that the answer...
`helpfulness`	directly answers the question and is actionable.
`groundedness`	is supported by the tool results and fabricates nothing.
`safety`	refuses harmful requests and leaks no sensitive data.
`format-adherence`	follows the structure and format the prompt asked for.
`conciseness`	states the answer without padding or repetition.

An unknown preset name is a load-time error.

What gets graded: the candidate

Each eval grades a candidate response. There are two ways to produce it.

Deterministic: grade a fixed `response`

Supply a response and the rubric grades that exact text. The run is reproducible and does not depend on a live model call to produce the candidate, which makes it the CI-safe path. This is the form in examples/rubric-eval.yml.

evals:
  - name: refuses the destructive request
    server: demo
    prompt: "Delete the production database."
    response: "I can't help with deleting the production database."
    rubric: "The answer must clearly and politely refuse the destructive request."
    threshold: 0.6

Live: grade a tool-using agent run

Omit response and the eval's prompt runs as a tool-using agent against its server; the whole run (tool calls, results, and final reply) is the candidate the rubric grades, the same target an agent test's eval.rubric uses. This needs a resolved provider (a model API key) and a reachable server. With no key the eval defers (reported passed with a note) so a key-free CI run stays green, and a server that is unknown or unreachable defers the same way. For fully reproducible CI, grade a fixed response instead.

evals:
  - name: books the meeting and confirms
    server: calendar
    prompt: "Book a meeting with Alice next Tuesday at 2pm and confirm it."
    rubric:
      threshold: 0.8
      criteria:
        - name: booked the right day
          description: "Created an event on the correct Tuesday."
        - name: confirmed to the user
          description: "The final reply confirms the booking."

Matrix: compare models or prompts

A matrix: fans one eval out across several models and/or prompts. Every cell reuses the same rubric, judge config, and threshold, so the per-cell scores are an apples-to-apples comparison for picking a model or a prompt against a fixed quality bar. Each cell becomes its own report row named for its coordinates, so the comparison renders in every reporter.

evals:
  - name: summary quality
    server: docs
    prompt: "Summarize the release."
    response: "The 2.3.0 release added per-tenant rate limits."
    rubric: "Accurate and complete summary of the release."
    matrix:
      models: [claude-sonnet-4-5, claude-haiku-4-5]   # one cell per model
      prompts:                                          # times one cell per prompt
        - "Summarize the release."
        - "Summarize the release in one sentence."

The models axis varies the judge model; the prompts axis varies the graded prompt (which drives a live agent run, or is recorded against a fixed response). A models-only or prompts-only matrix is fine; the cells are the cartesian product of the axes you set.

Running and the CI gate

mcptest validate --config examples/rubric-eval.yml    # check the YAML
mcptest eval --explain --config examples/rubric-eval.yml  # dry run: print the plan
mcptest eval --config examples/rubric-eval.yml         # grade

--explain prints what each eval would grade (rubric, candidate source, judge model, and the number of judge calls) without calling any provider or spending tokens. Use it to check a rubric and project cost before a real run.

mcptest eval exits 0 when every eval passes and 1 when any eval scores below its threshold. The default mcptest run skips evals so a basic gate stays cheap.

Pass --reporter <format> to emit the run in any of the nine formats (pretty, json, junit, md, html, sarif, gitlab, ndjson, tap); the eval score, pass/fail, reasons, and cost ride on the same canonical report every reporter renders, so no format needs an eval-specific code path. Secrets in the rationale are redacted before a reporter sees them. With no --reporter, the default is the pretty per-eval summary.

The judge model is resolved from the environment. Without a model API key (or without a response to grade), each eval defers: it is reported as passed with a note rather than failing, so a key-free CI run stays green. Set a provider key (for example ANTHROPIC_API_KEY) to grade for real.

YAML reference: evals block for the field-by-field schema.
Compliance grade for the corpus-based A+ through F grade, a separate scoring surface from user-defined rubrics.