The observable-evidence oracle

A security or agent verdict in mcptest is decided by observable evidence, never by a model's narration of what it did. This page explains the rule and why it exists.

The threat

When an agent runs against a server, it produces two very different kinds of output. There are the observable artifacts: the tool calls it actually made, the arguments it sent, the results the server actually returned, the status codes, the payload shapes. And there is the narration: the model's own prose in its final response, where it says what it believes happened.

The narration can be confidently wrong. A model can write "done, the file was saved" while having made no tool call at all, or "authorization granted" while the server returned an error. A grader that reads the narration can be talked into a passing verdict the run did not earn. The research filed measures exactly this failure: rewriting only the chain-of-thought of a trajectory, holding the actions and observations fixed, inflates judge false-positive rates by up to about 90 percent (Gaming the Judge, arXiv:2601.14691). Trivial token tricks push reward-model false positives to 80 percent (One Token to Fool, arXiv:2507.08794), and LLM judges are not reliable at measuring adversarial robustness at all (A Coin Flip for Safety, arXiv:2603.06594).

The rule

Grade against the observed tool result, not the narrated outcome. Concretely:

A success oracle targets an observable artifact: tool_calls[i].name, tool_calls[i].args, tool_results[i], a returned payload's schema, or a status code.
Model-generated chain-of-thought or final-response narration, if it is passed to a grader at all, is untrusted. It never decides a verdict.
For security verdicts the oracle is fully deterministic: no model is in the pass-or-fail path. An LLM may add an advisory signal (see the security framework), but advisory findings never flip a deterministic grade.

mcptest's matchers already make this the easy path. A matcher targets a value selected from the trace by path, so a suite author writes the assertion against the observable artifact directly. The regression guard in crates/mcptest-core/tests/scorer_cot_regression.rs pins the rule: each adversarial fixture narrates a confident success while its observable artifact shows a failure, and the deterministic matcher returns FAIL every time. A narration-reading grader would have passed all of them. That gap is the regression the guard protects against.

Code-mode and parallel calls

The oracle observes every tool call in a run, not only the ones the model typed directly into an assistant turn. Two cases matter here.

First, parallel tool use. When a model emits several tool calls in one turn, the driver records all of them, in order, with one result per call. A suite can assert on each, so a call hidden behind a sibling call is still graded.

Second, code-mode calls. Anthropic programmatic tool calling and Cloudflare Code Mode run model-written code in a sandbox, and that code calls tools. The provider tags those calls with a code-execution caller. The trace keeps that tag in an optional caller field on each tool call, so an action taken inside generated code shows up on the same observable paths as a direct call: tool_calls[i].name and tool_calls[i].args resolve to it either way. A direct call leaves the field off, which keeps the envelope shape stable for suites written before this change. The result is that an injected action performed in code cannot hide from the oracle.

References

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation. arXiv:2601.14691.
One Token to Fool LLM-as-a-Judge. arXiv:2507.08794.
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness. arXiv:2603.06594.