Code-mode testing
Code mode is the access pattern where the model does not call tools directly. It writes code that calls a typed tool API, and the code runs in a sandbox whose only egress is the bound tools (Anthropic programmatic tool calling, Cloudflare Code Mode). A server can behave differently under code mode, and a harmful action taken inside generated code is easy to miss if you only watch the direct tool-calling path.
mcptest tests code mode without embedding a JavaScript engine. The run is captured as a conversation trace whose tool calls carry a code-execution marker, and the harness asserts on the calls and results the code actually made. An action performed in generated code is still a tool call on the wire, so it stays observable. The live code execution sits behind the trace, so a code-mode run replays from a cassette without a model or a sandbox.
Run this example. examples/codemode.yml records a code-mode agent run and replays it from a cassette, asserting the call the generated code made and the shape of the result it received.
ANTHROPIC_API_KEY=... mcptest run --record --config examples/codemode.yml
mcptest run --config examples/codemode.yml
What the harness checks
Given a captured run and the tool catalog, the harness reports:
- Pure code mode. Every tool call came from the code-execution sandbox and at least one did. A direct call in the trace means the run was not actually code mode, and the harness says so rather than passing it.
- Error and timeout surfacing. Tool results the server flagged as errors. This is the observable signal that an RPC failure or a timeout reached the generated code instead of vanishing inside the sandbox.
- Schema drift. For each result whose tool declared an
outputSchema, the harness runs the structured-output conformance check (SCHEMA-006). A failure is drift between the declared tool schema and what the generated binding received, which is exactly the breakage code mode is prone to.
Cassette-replayable example
The fixtures under crates/mcptest-agent/tests/fixtures/codemode/ are a recorded code-mode run and its catalog. The run makes two code-execution calls; the second returns a string where the schema requires an integer. Replaying it shows the harness reporting a pure code-mode run with one schema-drift finding, with no model in the loop. That is the pattern for adding code-mode coverage: capture a run once, then assert on it deterministically.