mcptest docs GitHub

Observability and eval-platform exports

mcptest is the MCP-specific source of truth for a run: which tools were called, which checks passed, what the judges decided, what it cost. The eval and observability market (Braintrust, LangSmith, Arize, Galileo, Patronus, plus any OpenTelemetry collector) speaks in traces and spans. The openinference export projects a run into that world so you keep one MCP-aware runner and still see results in the dashboards your team already uses, with no vendor SDK bundled into mcptest.

The export

openinference is a reporter, so it works from both surfaces:

# Straight from a run.
mcptest run tests.yaml --reporter openinference --output run-trace.jsonl

# Or re-render a saved JSON run later (no re-execution).
mcptest run tests.yaml --reporter json --output run.json
mcptest report run.json --format openinference > run-trace.jsonl

The output is JSONL: one span per line, each self-describing with a schema_version of mcptest.dev/openinference/v1. The shape reuses the run report and the pinned OpenTelemetry span-name conventions (mcptest.run, mcptest.test) rather than inventing a parallel trace model.

Span shape

A run becomes a small span tree:

{"schema_version":"mcptest.dev/openinference/v1","trace_id":"d58a4d69...","span_id":"61cc95b2...","parent_span_id":null,"name":"mcptest.run","span_kind":"CHAIN","status_code":"OK","duration_ms":12,"attributes":{"mcptest.run.id":"019...","mcptest.run.total":3,"mcptest.run.passed":3,"mcptest.run.failed":0}}
{"schema_version":"mcptest.dev/openinference/v1","trace_id":"d58a4d69...","span_id":"01cfdb87...","parent_span_id":"61cc95b2...","name":"mcptest.test","span_kind":"CHAIN","status_code":"OK","duration_ms":3,"attributes":{"mcptest.test.name":"search returns a hit","mcptest.test.status":"pass"}}

trace_id (32 hex) and span_id (16 hex) are derived deterministically from the run id, so re-exporting the same run produces the same ids and a consumer can correlate runs and link back to the source.

status_code follows OpenTelemetry: OK for a pass, ERROR for a fail, UNSET for a skip or an inconclusive verdict. The run span is ERROR when any test failed.

Using it with eval and observability tools

The export is a neutral interchange format; each platform ingests it through its own collector or import path, with no mcptest-side coupling:

Vendor-specific uploaders (pushing directly to a hosted endpoint) are deliberately out of the OSS core to avoid bundling vendor SDKs; the JSONL is the stable contract an adapter builds on.

Provenance and evidence

Pair the trace with an evidence pack: the pack carries the run's grades, coverage, and a signed digest, while the trace carries the per-span detail. Both key off the same run id, so a reviewer can move from a dashboard span to the signed governance artifact for the same run.

Stability

The span names, span_kind values, and schema_version are stable; new attributes may be added under the mcptest.* namespace without a version bump. A breaking change to the shape bumps mcptest.dev/openinference/vN.