Importing external benchmarks and eval corpora

A published MCP benchmark or a third-party eval config is a list of tasks: a prompt and an expectation. mcptest import normalizes that list into a mcptest agent suite, so you can run someone else's corpus through your own server with the same engine, reports, and evidence packs you already use, instead of rebuilding the harness. Every import records where it came from and what it skipped, so the generated suite is auditable.

This is the agent/eval side of importing. The security-scanner side (SARIF, Snyk, generic findings) is mcptest security import, which folds external findings into the security report as advisory supplements. See security-report.md.

Quick start

# Turn a generic task corpus into a suite that targets your server.
mcptest import corpus.json --server my-server --output imported.yml

# Edit the placeholder server command, then run it like any suite.
mcptest run --config imported.yml

The generated suite opens with a provenance header and declares each referenced server with a placeholder command for you to point at the real one:

# Imported by `mcptest import` (mcptest.dev/benchmark-import/v1).
# source: generic
# source_url: https://example.test/bench
# license: CC-BY-4.0
# imported: 12 task(s); skipped: 1
#   skipped row-7 (no prompt/goal field to drive the agent)
servers:
  my-server:
    command:
      - ./your-server
agents:
  - name: weather lookup
    model: claude-sonnet-4-5
    servers: [my-server]
    prompt: What is the weather in Reno?
    expect:
      - target: final_response
        matcher: { contains: Reno }

Input shapes

--kind generic (default) reads mcptest's documented task list: a top-level JSON array, or { "tasks": [ ... ] }. Each task uses the first present of:

prompt: prompt, goal, input, or query (required; a row without one is skipped),
name: name, description, or id (falls back to imported task N),
expectation: an expect object with one of regex, contains, or exact/equals, mapped to a final_response matcher.

--kind promptfoo reads a promptfoo config: the tests: array (or redteam.tests:). The prompt comes from vars (prompt/query/input/ question); the first usable assert of type contains/icontains/regex/ equals becomes the expectation.

Provenance and attribution

The header is the import's audit trail. --source, --source-url, and --license record where the corpus came from and under what terms; transform_version pins the mapping (mcptest.dev/benchmark-import/v1) so a re-import with a newer mcptest is traceable. Skipped rows are listed with the reason they could not be mapped, so the import never silently drops a task.

When you redistribute an imported corpus, keep its upstream license: record it with --license and add a NOTICE entry if the corpus is vendored into the repo.

Reports and evidence

An imported suite is a normal suite. Running it produces the usual report, and that report folds into an evidence pack and the OpenInference export like any other run, so an imported benchmark contributes to the same governance artifacts as your own tests. No separate path, no parallel model.

Keeping it out of fast CI

Imported corpora are often large and need a live model, so they do not belong in the default fast gate. Keep the generated suite in a separate directory and run it on its own schedule (a nightly job, or a --record pass that commits cassettes for offline replay), the same way the bundled agent examples are recorded once and replayed key-free.