mcptest docs GitHub

Compliance score delta (CI PR comment and merge gate)

The compliance baseline gives you a binary signal: a rule is on the expected-failures list or it is not, and CI flips red the moment an off-list rule fails. That answers "did anything break?" but not "by how much did the score move?".

The score-delta report is the continuous companion. It takes two whole compliance scores, a stored prior run and the run produced on a pull request, and reports how the overall letter grade and pass rate moved, plus a per-section improved / unchanged / regressed verdict.

Gate on it: the grade_delta: block

A compliance suite (the object form, with tests:) gates the run against a saved prior scorecard with a grade_delta: block:

compliance:
  spec_version: v2025-06-18
  grade_delta:
    against: ./compliance-scorecard.json   # a scorecard saved from a prior run
    expect:                                # optional; defaults below apply if omitted
      - target: grade_regressed
        matcher: { exact: false }
      - target: regressed_sections
        matcher: { schema: { maxItems: 0 } }
  tests:
    - name: PROTO-002
      server: local
      check: "initialize"

The targets:

Omit expect: and the engine applies the sane defaults: grade_regressed == false and regressed_sections empty. The whole flow:

# 1. Save this run's scorecard.
mcptest compliance run --from-suite compliance.yml \
  --save-scorecard compliance-scorecard.json

# 2. Commit compliance-scorecard.json, point grade_delta.against: at it.

# 3. Future runs gate against it. A grade drop or a regressed section fails.
mcptest compliance run --from-suite compliance.yml

A first run with no prior scorecard passes with a "no baseline" note, so adding the gate never fails the build on the day it lands. A present-but-unreadable scorecard is a load error. The gate prints one grade_delta: PASS|FAIL line and a failure exits non-zero.

The worked suite is examples/compliance-grade-delta.yml.

How it pairs with the baseline

The two features sit side by side and answer different questions:

You can run both. The baseline keeps the floor; the delta watches the slope.

What it compares

The delta is computed from two ComplianceScore values. For each it reads:

A section that was applicable in both runs is comparable and gets a verdict. A section that was not applicable in either run (for example, the server stopped declaring tools) is reported as "not comparable" and never trips the gate, because there is no meaningful score to diff.

The section breakdown

The comparison the gate reads is per-section, not just a single overall grade. A regressed run reads like this (overall grade, overall pass rate, then a per-section verdict):

Compliance score delta: A+ -> C (regressed)
Overall pass rate: 100.0 -> 72.7 (-27.3)
3 improved, 1 unchanged, 1 regressed.

Lifecycle        A+ -> A+   100.0 -> 100.0   unchanged
Tools            A+ -> C    100.0 -> 50.0    regressed
Resources        A+ -> A+   100.0 -> 100.0   unchanged
Prompts          A+ -> A+   100.0 -> 100.0   unchanged
Transport / Auth A+ -> A+   100.0 -> 100.0   unchanged

regressed_sections here is ["tools"], so the default regressed_sections empty gate fails the run, and grade_regressed is true because the overall grade fell from A+ to C. The verdict words are plain (improved, unchanged, regressed), with no emoji or arrows.

The merge gate

The grade_delta: block is the gate. With the default expect:, the run fails when the overall grade fell or any section regressed. To tolerate small moves, write an explicit expect: that loosens the bar, for example gating only on the grade and the current score:

grade_delta:
  against: ./compliance-scorecard.json
  expect:
    - target: grade_regressed
      matcher: { exact: false }
    - target: score
      matcher: { schema: { minimum: 90 } }

A grade fall is the strongest signal: a run can shed a single SHOULD and stay numerically close yet drop a letter grade, and grade_regressed catches that even when the numeric score barely moves. Because the gate reuses the standard assertion grammar, you tune it with the same matchers every other check uses.

Determinism

score_delta is pure: identical inputs produce identical output, the section list is ordered by the canonical section order, and the summary is pre-computed so reporters never re-scan. ScoreDelta is serialize-able, so CI can cache the delta JSON alongside the run for a trend tool to chart later. The hosted historical baseline and trend dashboard are out of scope; the OSS engine ships the single-run diff, comment, and gate.

References