Suite composition

Large suites get hard to maintain when every test repeats the same assertions and every assertion is treated as all-or-nothing. These primitives let a suite weight assertions, group them into sets with their own thresholds, gate a test on a combined score, share a baseline across tests, derive reported metrics, and run setup and teardown commands. They mirror promptfoo's config model.

Every primitive is optional with a back-compatible default. A suite that uses none of them behaves exactly as before: a test passes only when every assertion passes.

Run this example. examples/suite-composition.yml weights assertions, groups them into sets with their own thresholds, gates a test on a combined score, shares a defaultTest baseline, and derives a metric.

mcptest run --config examples/suite-composition.yml

Per-assertion weight

Each assertion may carry a weight (default 1.0). The weight only matters when the test computes a combined score (it declares a threshold, or the assertion sits inside an assert-set). Without a threshold the weight is inert: every assertion still has to pass.

tools:
  - name: search returns a useful result
    server: api
    tool: search
    args: { q: "rust" }
    threshold: 0.7
    expect:
      - target: result.isError
        matcher: { exact: false }
        weight: 3.0
      - target: result.content[0].text
        matcher: { contains: "rust" }
        weight: 1.0

The combined score is the weighted pass-fraction: sum(weight of passing) / sum(all weights). If the first assertion passes and the second fails, the score is 3 / 4 = 0.75, which clears the 0.7 threshold, so the test passes.

Assert-set

An assert-set is a named bundle of assertions with its own threshold (0 to 1) and an optional weight. The set passes when its internal weighted pass-fraction is at or above the set threshold. The set then contributes to the parent test's combined score as a single unit: a passing set adds its weight, a failing set adds zero.

tools:
  - name: response mentions enough of the keywords
    server: api
    tool: summarize
    args: { topic: "deployment" }
    expect:
      - assert-set:
          name: keyword-coverage
          threshold: 0.6
          weight: 1.0
          assertions:
            - target: result.content[0].text
              matcher: { icontains: "rollback" }
            - target: result.content[0].text
              matcher: { icontains: "canary" }
            - target: result.content[0].text
              matcher: { icontains: "bluegreen" }

Two of the three string checks passing gives an internal score of 2 / 3 = 0.667, which clears the 0.6 set threshold, so the set passes. With no test-level threshold here, the legacy rule applies to the set as a unit: the set passing means the test passes.

Test-level threshold

A test's combined score is the weighted pass-fraction across its assertions and assert-sets (each passing item contributes its weight, each failing item contributes zero). When the test declares a threshold (0 to 1), it passes when the combined score is at or above it.

When a test declares no threshold, behavior is unchanged from earlier releases: every assertion and assert-set must pass. The score model is strictly opt-in. The combined score is reported only for tests that set a threshold, so the report JSON for legacy tests is byte-for-byte the same.

tools:
  - name: three of four checks is good enough
    server: api
    tool: profile
    threshold: 0.75
    expect:
      - target: result.name
        matcher: { schema: { type: string } }
      - target: result.email
        matcher: { regex: "@" }
      - target: result.age
        matcher: { exact: 30 }
      - target: result.country
        matcher: { exact: "US" }

If three of the four equal-weight assertions pass, the score is 0.75, which clears the threshold (the comparison is inclusive), so the test passes.

defaultTest

defaultTest is a suite-level baseline merged into every test. Its assertions and assert-sets are appended to each test (the test keeps its own first), and its options act as defaults that an explicit per-test value overrides. This avoids repeating the same "tool did not error" assertion on every test.

defaultTest:
  threshold: 0.8
  expect:
    - target: result.isError
      matcher: { exact: false }

tools:
  - name: inherits the baseline
    server: api
    tool: list
    expect:
      - target: result.items
        matcher: { schema: { type: array } }

  - name: overrides the threshold
    server: api
    tool: flaky
    threshold: 0.5
    expect:
      - target: result.items
        matcher: { schema: { type: array } }

The first test runs with two assertions (its own result.items check plus the baseline result.isError check) and inherits the 0.8 threshold. The second test also gets both assertions but keeps its own 0.5 threshold, because an explicit value always wins over the default.

derivedMetrics

A derived metric is a value computed by weighted aggregation over named assertion and assert-set scores. Reference other scores by name and combine them with weighted_sum or weighted_average. This is not an expression language and there is no JavaScript evaluator: a derived metric is only a named weighted aggregation. That limitation is deliberate so every reported number is auditable.

Derived metrics evaluate in declaration order, so a metric may reference an earlier derived metric by name (chained derivation), in addition to named assertions and assert-sets. A reference to a derived metric declared later or to itself is a load-time error, since it could never resolve.

A derived metric is reported on the run. When it carries a threshold, it also gates: the test fails if the metric is below the threshold, even when the test's own combined-score threshold is met.

Across a run, derived metrics roll up to a suite-level summary: each metric name gets a mean (over the tests that reported it) and a count. The Markdown and comparison-matrix reporters render this rollup as a "Metric rollups" table beneath the per-test results.

tools:
  - name: blended quality score
    server: api
    tool: answer
    threshold: 0.5
    expect:
      - target: result.content[0].text
        matcher: { icontains: "sources" }
        name: cites_sources
      - target: result.content[0].text
        matcher: { regex: "^[A-Z]" }
        name: well_formed
    derivedMetrics:
      - name: quality
        threshold: 0.7
        value:
          weighted_average:
            - { ref: cites_sources, weight: 2.0 }
            - { ref: well_formed, weight: 1.0 }

If cites_sources passes (score 1.0) and well_formed fails (score 0.0), the quality metric is (1.0 * 2 + 0.0 * 1) / (2 + 1) = 0.667. That is below the metric's 0.7 threshold, so the test fails even though the assertion pass-fraction (0.5) met the test threshold. A referenced name that does not match any named assertion or set scores 0.0 and is reported so a typo is visible.

Lifecycle hooks

Hooks run shell commands around the run and around each test. beforeAll and afterAll run once for the whole run. beforeEach and afterEach run around each test, which is the case worth having for stateful MCP session setup and teardown. Each phase takes a single command string or a list of command strings, run in declaration order.

hooks:
  beforeAll: ./scripts/seed-db.sh
  afterAll:
    - ./scripts/dump-logs.sh
    - ./scripts/teardown-db.sh
  beforeEach: ./scripts/reset-session.sh
  afterEach: ./scripts/snapshot-session.sh

The ordering across a two-test run is:

beforeAll
  beforeEach -> test 1 -> afterEach
  beforeEach -> test 2 -> afterEach
afterAll

A non-zero exit from beforeAll fails every test in the run, because no test could run against the setup it needed. A non-zero exit from beforeEach fails that one test without dispatching it. Failures from afterAll and afterEach are reported but do not by themselves fail a test, since the test already ran. When a suite declares any per-test hook, the runner executes tests serially so the bracket holds for stateful sessions, and declaring hooks opts the run out of result caching.

A hook is any executable

A hook command is not limited to bash. It is any command on your PATH (or an absolute or relative path to a script), so you can write hooks in whatever language fits the job:

hooks:
  beforeAll: node ./setup.js
  beforeEach: python seed.py
  afterAll: ./cleanup.sh

mcptest splits the command string with POSIX shell rules and runs the program directly. It does not wrap the command in sh -c, so a hook is one program plus its arguments, not a shell pipeline. If you need shell features (pipes, redirection, &&), put them in a script file and call the script, or invoke the shell yourself: bash -c '...'.

Suite-level environment variables pass through to the hook's environment, the same way they reach the server subprocess. A node ./setup.js hook reads them from process.env, a python seed.py hook from os.environ, and a shell script from $NAME. That is how a hook learns which database URL to seed or which fixture port to bind.

These command hooks are for side-effecting setup and teardown: start a fixture server, seed a store, write a temp file, remove it again. They run for their effects and their exit code. They do not receive the run context on stdin and mcptest ignores anything they print. If you need a hook that reads the phase, suite, test, env, and vars, or that returns data to inject vars or skip a test, use the context-aware hooks described next, which add a JSON-over-stdio contract on top of the plain command form.

Context-aware hooks

A hook step can opt into a richer contract that hands the hook the run context as JSON and lets it return a small patch. A step is either a plain command string (everything above) or a map that declares mode: context:

hooks:
  beforeEach:
    - echo plain-still-works
    - command: node ./compute-vars.js
      mode: context

mode: context is required on the map form; a map without it is a config error, so the opt-in is never ambiguous. A bare string stays a plain hook.

A context hook reads one JSON object on its stdin:

{
  "phase": "beforeEach",
  "suite": { "name": "my-suite" },
  "test": { "name": "lists tools" },
  "env": { "BASE_URL": "http://localhost:8080" },
  "vars": { "region": "us-west" },
  "result": null
}

test is null for the beforeAll and afterAll phases (there is no single current test). result is present only for the afterEach and afterAll phases and carries the test outcome ({ "verdict": "pass" }, with a message when it failed). The phase also arrives on the MCPTEST_HOOK_PHASE environment variable, so a jq-style hook can branch without parsing the payload.

The hook may write one JSON object to stdout, restricted to three keys:

{ "env": { "TOKEN": "..." }, "vars": { "session": "..." }, "skip": { "reason": "no creds" } }

Every key is optional, so empty output (or no output at all) is a valid no-op. Any key outside env, vars, and skip is an error, so a hook cannot mutate arbitrary suite state.

The before and after phases treat the patch differently:

For beforeAll and beforeEach, mcptest applies the patch. env and vars merge into the context the following hooks see (a beforeAll injection reaches every beforeEach hook). skip skips the affected tests with the given reason: a beforeAll skip skips the whole run, a beforeEach skip skips that one test.
For afterAll and afterEach, the hook still receives the result on stdin, but its patch is advisory and ignored. An after hook may not retroactively fail or rewrite a test that already ran. Inspect the result, log it, ship it somewhere, but the verdict is fixed by then.

A context hook fails the same way a plain hook does: a spawn failure, a non-zero exit, a timeout, or unparseable stdout fails the affected test in a before phase and is reported (not fatal) in an after phase. The injected vars and env flow through the hook chain and the skip decision, not into the tool-call arguments themselves (those are resolved from the YAML before the run, the same boundary the transform step documents). See examples/hooks-context/ for a runnable script.

Reporting

The JSON reporter surfaces two new per-test fields when the score model is active: score (the combined 0 to 1 value, present only for tests with a threshold) and derived_metrics (each with its name, value, and, when gated, passed). Legacy tests omit both fields, so existing consumers are unaffected.