Research grounding

mcptest is a research-anchored project, not an ad hoc one. Three audiences read this page. Researchers should be able to trace any design decision back to the literature that motivated it. Evaluators should be able to confirm that the methodology is defensible and grounded in peer-reviewed and preprint work, not vibes. Contributors should be able to see, before touching a matcher or a reporter, the prior art that shaped how that piece behaves. Every citation below links to a specific mcptest design decision via the "Informs" line at the end of the entry.

Each entry carries a disposition: adopt (shipped or being built), partial (some of the idea already ships), and watch (tracked but demand-gated, so nothing is committed yet). This keeps the page honest about what is shipped versus on the roadmap.

LLM-as-judge and LLM-as-jury

These works inform the W7 milestone, where mcptest will support an optional "judge" matcher that scores tool output with a panel of small models rather than a single large one.

Verga et al. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796. Seminal paper validating the multi-juror approach. A Panel of LLM-as-Judges (PoLL) outperforms a single large judge, costs roughly 7x less, and reduces intra-model bias. Informs: W7 milestone.
Survey paper (2024). A Survey on LLM-as-a-Judge. arXiv:2411.15594. Reliability strategies, bias taxonomy, and benchmark methodology in one place. Informs: W7 milestone.
Liu et al. (2024). Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments. arXiv:2504.17087. Meta-evaluation of judge quality, used to argue that we should publish judge agreement metrics alongside the verdict. Informs: W7 milestone.
Feng et al. (2025). Who Judges the Judge? LLM Jury-on-Demand. arXiv:2512.01786. Critiques static aggregation of juror votes and motivates the per-test consensus method we expose in the YAML config. Informs: W7 milestone.
Thakur et al. (2025). Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges. arXiv:2406.12624. Vulnerability taxonomy that drives the W7 threat model. Informs: W7 milestone.
Qian, Sun, Gales, Knill (2026). Who can we trust? LLM-as-a-jury for Comparative Assessment. arXiv:2602.16610. BT-sigma adds a per-judge discriminator to the Bradley-Terry model and infers item rankings and judge reliability jointly from pairwise comparisons, so reliability-weighted aggregation beats plain averaging. Motivates weighting jurors by measured reliability rather than treating every juror as equal.

Judge bias literature

These works inform the bias-mitigation knobs the judge matcher will expose, including pairwise randomization, length normalization, and self-preference controls.

Zheng et al. (2024). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Position bias documentation. Informs: W7 milestone.
Saito et al. (2023). Verbosity Bias in LLM Evaluation. Motivates the length-normalization option on the judge matcher. Informs: W7 milestone.
Panickssery et al. (2024). Self-preference bias in LLMs evaluating their own outputs. Argues against using the same model under test as the judge, which becomes a doctor check.
Li et al. (2025). Evaluating Scoring Bias in LLM-as-a-Judge. arXiv:2506.22316. Scoring-scale calibration; informs the rubric format. Informs: W7 milestone.
Schroeder and Wood-Doughty (2024). Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge. arXiv:2412.12509. Reliability ceiling estimates. Informs: W7 milestone.
NeurIPS 2025 workshop. The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge. arXiv:2509.26072. Shortcut features the judge prompt must explicitly suppress. Informs: W7 milestone.
(2026). Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering. Software-domain bias patterns, relevant because most mcptest tools wrap developer workflows. Informs: W7 milestone.
(2026). Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines. arXiv:2604.23178. Comparative review of mitigation strategies that shapes our defaults. Informs: W7 milestone.
Khalifa et al. (2026). Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation. arXiv:2601.14691. Rewriting only the chain-of-thought (actions and observations held fixed) inflates judge false-positive rates up to about 90 percent; argues judges must grade against observable evidence and treat narrated reasoning as untrusted.
Dechtiar, Katz, Jaume, Wang (2025). LLM as a Judge for Evaluating Contract Graphs. SSRN 5937996. Multi-judge ensemble with inter-judge disagreement as a calibrated uncertainty signal, motivating a confidence band on scorecards rather than a bare letter grade.
(2025). Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations. arXiv:2510.11822. Jurors tend to agree with a presented answer rather than judge it independently, which inflates consensus; argues for independent scoring or an explicit dissent step.
(2025-2026). Judge robustness under adversarial input. One Token to Fool LLM-as-a-Judge (arXiv:2507.08794, master-key tokens drive reward-model false positives up to 80 percent), Toward Robust LLM-Based Judges (arXiv:2603.08091, a twelve-type bias taxonomy with debiasing), A Coin Flip for Safety (arXiv:2603.06594, LLM judges fail to reliably measure adversarial robustness), and How to Correctly Report LLM-as-a-Judge Evaluations (arXiv:2511.21140). Together they reinforce the observable-evidence oracle and caution against LLM-judged security verdicts.

MCP-specific empirical research

These works ground the doctor checks and the cost-and-latency reporting in measured behavior of real MCP servers, not in folklore.

Hasan et al. (2026). Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions. arXiv:2602.14878. Direct motivation for the max_response_tokens matcher and the tool_description_tokens doctor check.
(2025). Network and Systems Performance Characterization of MCP-Enabled LLM Agents. arXiv:2511.07426. Token efficiency, time-for-task, and price-per-task measurements that motivate the budget matchers and the cost reporter.
(2026). MCP at First Glance: Studying the Security and Maintainability of MCP Servers. arXiv:2506.13538. Maintainability findings that motivate cassette replay and config linting. Informs: (this page) and W6 milestone.
Fei et al. (2025). MCP-Zero: Active Tool Discovery for Autonomous LLM Agents. arXiv:2506.01056. Tool-discovery semantics that the discovery matcher in mcptest-core mirrors. Informs: W5 milestone.
(2025). Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. arXiv:2503.23278. Threat taxonomy that informs which security checks the OSS doctor surfaces.

MCP agent benchmarks

These benchmarks set the bar that mcptest's example suite and doctor checks align against. Where a benchmark exposes a public test corpus we use it as ground truth for the matcher library.

(2026). MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use. arXiv:2512.24565. Real-world task coverage that shapes the example YAML suite. Informs: W5 milestone.
(2025). MCPToolBench++: A Large-Scale AI Agent MCP Tool Use Benchmark. arXiv:2508.07575. Scale-out evaluation patterns that inform our parallel runner. Informs: W4 milestone.
(2025). MCP-AgentBench: Evaluating Real-World Language Agent. arXiv:2509.09734. Task selection methodology that backs our default doctor thresholds.
(2026). MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols. arXiv:2508.13220. Security regression suite that informs the doctor's security checks. Informs: (this page).
Luo, Shi, Lin, Gao (2025). Evaluation Report on MCP Servers (MCPBench). arXiv:2504.11094. Benchmarks widely used MCP servers on accuracy, time, and token usage (the most effective, Bing Web Search, reached 64 percent accuracy) and finds a declarative interface substantially improves accuracy. Grounds the example suite and the cost reporter. Informs: W5 milestone.
(2025). MSC-Bench: Multi-Server Tool Selection for LLM Agents. arXiv:2510.19423. Scores tool selection with an equal-function-set objective so credit goes to choosing the right tool among interchangeable candidates rather than to label matching. Direct motivation for an F1 tool-selection scorer over equal-function tool sets. Disposition adopt. Informs: tool-selection F1.
(2026). MCP-Atlas: Name-Free Tool Discovery and Orchestration Diagnostics. arXiv:2602.00933. Discovers tools by capability rather than by name and adds diagnostics that localize where a multi-step orchestration went wrong. Motivates name-free discovery checks plus orchestration diagnostics in the runner. Disposition adopt. Informs: name-free discovery.
(2026). MCP Pitfall Lab: Narrative-vs-Trace Divergence in Agent Evaluation. arXiv:2604.21477. Measures the gap between what an agent says it did (the narrative) and what its execution trace shows it actually did. Motivates a narrative-vs-trace divergence scorer that grades against the observed trace. Disposition adopt. Informs: narrative-vs-trace divergence.

MCP security

These works build the MCP threat taxonomy, the threat-benchmark corpora, the attack analyses, and the defense signals that the doctor checks and the scorecard's defense-posture view draw on. Surveys and systematization come first, then threat benchmarks, attack analyses, and defenses.

(2026). Systematization of Knowledge: Security and Safety in the MCP Ecosystem. arXiv:2512.08290. First academic SoK; splits adversarial threats (indirect prompt injection, tool poisoning) from epistemic safety hazards and analyzes the three MCP primitives (Resources, Prompts, Tools).
Li & Gao (2026, DSN 2026). A First Look at the Security Issues in the MCP Ecosystem. arXiv:2510.16558. Peer-reviewed empirical study of 67,057 MCP servers across six registries; weak vetting lets adversarial or hijacked servers enter hosts, and the companion MCPInspect flagged 833 vulnerable servers and 18 with deceptive descriptions. Grounds the coverage credibility of the taxonomy.
(2026). MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis. arXiv:2604.07551. Maps each attack to an architectural layer with a primary and secondary defense layer; organizing frame for the scorecard defense-posture view.
(2025). MCPTox. arXiv:2508.14925. Tool poisoning on 45 real servers and 353 tools, with attack success above 60 percent and refusal under 3 percent.
(2026, ICLR 2026). MCP-SafetyBench. arXiv:2512.15163. 20 attack types across 5 domains, multi-step and multi-server on real servers.
(2025-2026). Additional threat corpora. MCP-AttackBench (about 70k attack samples), SafeMCP (arXiv:2506.13666), SHADE-Arena (sabotage), and MCIP-bench (taxonomy-driven). Candidates for scale and additional corpora.
Maloyan & Namiot (2026). Breaking the Protocol: Security Analysis of the MCP Specification and Prompt Injection Vulnerabilities (AttestMCP). arXiv:2601.17549. Protocol-level analysis covering capability-attestation gaps, sampling without origin authentication, and multi-server trust propagation. Motivates trust-boundary checks and capability attestation across multi-server suites. Disposition adopt. Informs: trust-boundary conformance.
(2025-2026). Concrete attack analyses. Parasitic Toolchain Attacks (arXiv:2509.06572), When MCP Servers Attack (arXiv:2509.24272), and Caller Identity Confusion in MCP-Based AI Systems (arXiv:2603.07473). These feed the threat taxonomy and detection checks. Informs: the security test catalog.
Huang, Huang, Tran, Milani Fard (2026). MCP Threat Modeling and Analyzing Vulnerabilities to Prompt Injection with Tool Poisoning. arXiv:2603.22489. Compares how seven major MCP clients defend against tool poisoning (most fail on static validation and parameter visibility) and proposes a four-layer defense: static metadata analysis, model decision-path tracking, behavioral anomaly detection, and user transparency. Directly informs the description-injection and integrity checks.
Maloyan & Namiot (2026). Prompt Injection Attacks on Agentic Coding Assistants: Skills, Tools, and Protocol Ecosystems. arXiv:2601.17548. A systematization over 78 studies cataloguing 42 attack techniques across input manipulation, tool poisoning, protocol exploitation, and cross-origin context poisoning, with state-of-the-art defenses under 50 percent mitigation. Companion to Breaking the Protocol.
Beurer-Kellner & Fischer (2025). Conceptual vocabulary (Invariant Labs). Tool Poisoning, Shadowing, and Rug Pull; Preference Manipulation Attacks (Wang et al.). Source of the shared attack vocabulary.
Bhatt, Narajala, Habler (2025). ETDI: Mitigating Tool Squatting and Rug Pull Attacks. arXiv:2506.01333. OAuth-enhanced tool definitions, cryptographic identity, immutable versioned definitions, and policy-based access control (Cedar/OPA); source of tool-definition-integrity and auth-posture signals.
(2025). Detection and guardrail defenses. MCP-Guard (arXiv:2508.10991, 3-stage detection static-neural-fine-tuned E5 about 96 percent), MCP Guardian (arXiv:2504.12757, auth, rate-limiting, logging, tracing, WAF), and MCPGuard auto-vuln-detection (arXiv:2510.23673, agentic scanner).

Considered, not adopted

These were scanned in the same May 2026 sweep but fall outside mcptest's testing scope.

MCP in the Wild (SSRN 6374760). Multi-server orchestration, tangential.
Extending MCP for Telco Networks / MCP-T (SSRN 5211843). Vertical extension, not testing.
Agentic DraCor / Docstring Engineering (arXiv:2508.13774). Tool Correctness/Efficiency/Reliability vocabulary, fold into scorecard work.

Security testing frameworks and red-team tooling

These inform the multi-layer security testing design: how checks and attacks are structured, and which external engines mcptest runs.

OWASP MCP Top 10 (2025-2026, beta). owasp.org/www-project-mcp-top-10. The first official MCP-specific risk framework, MCP01 token mismanagement through MCP10 context over-sharing. The on-point external anchor for the catalog cross-walk.
Microsoft AI Red Team (2024). PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI Systems. arXiv:2410.02828. Composable red-team architecture (dataset, orchestrator, converters, target, scorer) that the adaptive attacker mirrors.
(2024). garak: A Framework for Security Probing Large Language Models. arXiv:2406.11036. Probe and detector plugin model (static, dynamic, adaptive) and a source of attack probes for the dynamic red-team.
(2023-2026). Automated and adaptive jailbreak generation. PAIR (arXiv:2310.08419), TAP (arXiv:2312.02119), adaptive instruction composition (arXiv:2604.21159), and MAD-MAX (arXiv:2503.06253). Iterative attacker-refinement strategies for the adaptive attacker.
Red-team tooling (ecosystem references). promptfoo (OWASP LLM presets, 50-plus attack plugins), DeepTeam (OWASP red-team framework), Cisco mcp-scanner (YARA plus LLM-judge plus AI Defense engines, CLI and REST API), and Invariant/Snyk mcp-scan (tool pinning, guardrails). External engines mcptest supplements and attack sources it imports.

CI testing methodology

These references shape how mcptest fits into a developer's CI loop. The philosophy is three-layer regression (functional, contract, performance) with strict false-positive budgets.

Steve Kinney (2026). API Contract Testing. Three-layer regression: functional plus contract plus performance. Informs: W3 milestone.
API Regression Testing: The Complete Guide (2026). Regression escape rate under 5%, false positive rate under 2%, three-stage gates. Informs: (this page) and W3 milestone.
Pact methodology. Consumer-driven contracts, broker pattern, and can-i-deploy verification, which the cassette tier mirrors for MCP. Informs: mcptest-cassette and W4 milestone.
Making CI/CD Pipelines Truly Autonomous (DevOps.com, 2026). Autonomous remediation patterns, relevant to the reporter's JUnit and JSON outputs. Informs: W3 milestone.

LLM regression testing for model migration

These works inform the future W8 milestone, which will let teams pin a test suite to a model version and detect drift when they upgrade.

Ma et al. (2024). (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs. arXiv:2311.11123. Foundational framing for model-version regression. Informs: W8 milestone.
Dixit et al. (2024, Adobe). RETAIN: Interactive Tool for Regression Testing Guided LLM Migration. arXiv:2409.03928. Interactive workflow we will adapt to the cassette diff view. Informs: W8 milestone.
(2025). Can LLM Generate Regression Tests for Software Commits? arXiv:2501.11086. Generation strategy that informs the planned mcptest generate subcommand. Informs: W8 milestone.
(2025). From REST to MCP: An Empirical Study of API Wrapping and Automated Server Generation for LLM Agents. arXiv:2507.16044. LLM tool-selection accuracy degrades as the available tool count grows (reported losses up to about 85 percent); also draws the OpenAPI-to-MCP codegen parallel relevant to mock-server synthesis.
(2024-2026). Reliability and reproducibility benchmarks. tau-bench (arXiv:2406.12045, pass^k consistency metric), ReliabilityBench (arXiv:2601.06112, state-comparison oracle and fault injection), FuncBenchGen (arXiv:2509.26553, contamination-free synthetic tool use). These motivate deterministic replay (cassettes) and a candidate pass^k scorecard dimension.
(2026). MCP reliability and failure-rate reporting (reference set). arXiv:2602.07150, arXiv:2601.06112, and arXiv:2603.29231. Reliability and failure-rate reporting methods for MCP servers, listed here as reference only. Disposition watch: demand-gated optional reliability reporting. Informs: reliability reporting.
Rabanser, Kapoor, Kirgis, Liu, Utpala, Narayanan (2026). Towards a Science of AI Agent Reliability. arXiv:2602.16666. Twelve reliability metrics across four dimensions (consistency, robustness, predictability, safety), with the finding that recent capability gains barely improve reliability. Direct input for the scorecard reliability dimensions. Informs: reliability reporting.

LLM application testing philosophy

The two papers below shape the overall stance of mcptest as a testing tool for LLM-adjacent systems: deterministic gates first, judged behavior second.

(2025). Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol. arXiv:2508.20737. The protocol described here is the closest analog to the mcptest test model. Informs: W1 milestone.
(2024). Blueprint First, Model Second: A Framework for Deterministic LLM Workflow. arXiv:2508.02721. Argues for deterministic workflow skeletons with LLM calls at the leaves, which is the architectural posture of mcptest. Informs: W1 milestone.

Lifecycle and evaluation framing

These works frame mcptest's whole-lifecycle positioning, from the evaluation drivers that map onto the scorecard to the observability layer that records what happened.

Xia, Lu, Zhu, Xing, Zhao, Zhang (2024). EDDOps. SSRN 5775317; preprint arXiv:2411.13768. Closest published framing to mcptest's whole-lifecycle positioning; six evaluation drivers (D1 to D6) map onto the scorecard, cassettes, CI gating, and compliance evidence, and the framework is validated against ISO/IEC 25010. SSRN blocks automated fetch, so this was verified against the arXiv preprint.
Xia, Lu, Zhu, Xing, Zhao, Zhang (2024). AgentOps. AgentOps Pattern Catalogue (SSRN 5534588) and AgentOps taxonomy (arXiv:2411.05285). Observability companion to EDDOps.

Industry tool-use and code-mode guidance

Vendor engineering guidance on how models should use tools. It shapes what mcptest checks: tool-definition quality, token efficiency, and the code-mode access pattern.

Anthropic (2026). Advanced tool use. anthropic.com/engineering/advanced-tool-use. Tool-use examples raised accuracy from 72% to 90%; descriptions should document the return format; the Tool Search Tool defers definitions and cuts catalog token cost by about 85 to 95% (a five-server MCP setup is roughly 55K tokens, real systems reach about 134K); programmatic tool calling runs model-written code that calls tools and returns only filtered results.
Cloudflare (2026). Code Mode: the better way to use MCP. blog.cloudflare.com/code-mode. Converts MCP tools into a typed TypeScript API the model writes code against, run in a V8 isolate whose only egress is the bound MCP servers. The access pattern means tool-call observability is the whole game for security testing, and the generated binding can drift from the declared schema.

Update cadence

Refreshed quarterly. PRs welcome via the mcptest repo. Contribution rules live in AGENTS.md.