mcptest docs GitHub

Session ledger

The session ledger is a structured, append-only record of the MCP tool calls a run made, keyed to a session. It exists so you can answer the question behavioral eval actually asks, "did the agent call the right tools, in the right order, with the right parameters, given the right inputs," by querying structure rather than parsing a transcript.

Two failure modes the ledger avoids:

Format

A ledger file is newline-delimited JSON (NDJSON): exactly one header record, then one tool_call record per call, in call order. Payloads are redacted with the same redactor the reporters and cassettes use before anything is written.

The records are validated by schemas/session-ledger-v1.json.

{"type":"header","schema_version":"v1","session_id":"019...","run_id":"019...","started_at":"2026-06-05T12:00:00Z","mcptest_version":"1.0.0","suite":"tests/agent.yml"}
{"type":"tool_call","session_id":"019...","agent_id":null,"hop_index":0,"tool_name":"search","server":"web","params":{"q":"weather sacramento"},"result":{"content":[{"type":"text","text":"..."}]},"is_error":false,"inputs_digest":"a1b2c3d4e5f60718","started_at":"2026-06-05T12:00:01Z","duration_ms":142,"caller":"direct"}
{"type":"tool_call","session_id":"019...","agent_id":null,"hop_index":1,"tool_name":"get_weather","server":"weather","params":{"city":"Sacramento"},"result":{"content":[{"type":"text","text":"72F"}]},"is_error":false,"inputs_digest":"f0e1d2c3b4a59687","started_at":"2026-06-05T12:00:02Z","duration_ms":88,"caller":"direct"}

Fields

Every tool_call record carries the structure eval queries:

The same fields are exposed on the agent run envelope for inline assertions: tool_calls[i].hop_index, tool_calls[i].agent_id, tool_calls[i].inputs_digest (see yaml-reference.md's agent target grammar), alongside tool_names and redundant_tool_calls.

CLI: emit and diff

mcptest ledger emit turns a saved agent run envelope (the tool_calls shape mcptest produces, for example a single agent test extracted from mcptest run --reporter json) into a ledger:

mcptest ledger emit envelope.json --session-id run-42 --output baseline.ndjson

mcptest ledger diff gates an actual run against a baseline. It compares the tool-call shape position by position, per agent_id: a different tool at a hop is a remove plus an add, a matching tool with different params is a param change. The command exits non-zero when the number of divergences exceeds --max-diff (default 0, exact match required), so it drops straight into CI:

mcptest ledger diff baseline.ndjson actual.ndjson --max-diff 0
  - removed  hop 1: fetch
  + added    hop 1: delete
ledger diff: 2 divergence(s) exceed --max-diff 0

This grades behavior at the tool boundary, not prose: the baseline is a recorded trajectory, CI captures a fresh ledger, and the diff fails the build when the agent stops doing the right thing.

Scope

The ledger captures one thing and leaves the rest to downstream tooling:

Because the schema is defined here, every emitter (the mcptest CLI in test or replay, a runtime proxy in production) and every consumer speaks one contract.