Code-mode testing

Code mode is the access pattern where the model does not call tools directly. It writes code that calls a typed tool API, and the code runs in a sandbox whose only egress is the bound tools (Anthropic programmatic tool calling, Cloudflare Code Mode). A server can behave differently under code mode, and a harmful action taken inside generated code is easy to miss if you only watch the direct tool-calling path.

mcptest tests code mode without embedding a JavaScript engine. The run is captured as a conversation trace whose tool calls carry a code-execution marker, and the harness asserts on the calls and results the code actually made. An action performed in generated code is still a tool call on the wire, so it stays observable. The live code execution sits behind the trace, so a code-mode run replays from a cassette without a model or a sandbox.

Run this example. examples/codemode.yml records a code-mode agent run and replays it from a cassette, asserting the call the generated code made and the shape of the result it received.

ANTHROPIC_API_KEY=... mcptest run --record --config examples/codemode.yml
mcptest run --config examples/codemode.yml

What the harness checks

Given a captured run and the tool catalog, the harness reports:

Pure code mode. Every tool call came from the code-execution sandbox and at least one did. A direct call in the trace means the run was not actually code mode, and the harness says so rather than passing it.
Error and timeout surfacing. Tool results the server flagged as errors. This is the observable signal that an RPC failure or a timeout reached the generated code instead of vanishing inside the sandbox.
Schema drift. For each result whose tool declared an outputSchema, the harness runs the structured-output conformance check (SCHEMA-006). A failure is drift between the declared tool schema and what the generated binding received, which is exactly the breakage code mode is prone to.

Cassette-replayable example

The fixtures under crates/mcptest-agent/tests/fixtures/codemode/ are a recorded code-mode run and its catalog. The run makes two code-execution calls; the second returns a string where the schema requires an integer. Replaying it shows the harness reporting a pure code-mode run with one schema-drift finding, with no model in the loop. That is the pattern for adding code-mode coverage: capture a run once, then assert on it deterministically.