Why We Killed MCP, And What It Freed Us To Build
<!— Stax flagship technical post. Pi-pattern as Exhibit A of MCP's own architectural pathology. Target ~2500 words. Voice: Anti-Edison /doctrine register. Distribution: HN Show HN + lobste.rs + r/zig + ziglang Discord + r/programming + LinkedIn long-post. —>
I. The Pattern That Gave It Away
We built a Model Context Protocol server. Sixteen tools, three thousand lines of Rust, hand-tuned tool descriptions, lazy schema warming, a four-tier launcher fallback chain, holdco-aware crate metadata. It was good code. It still is. It is also the artifact that taught us we had built the wrong thing.
The teaching moment was a comment in crates/stax-mcp/src/tools/mod.rs. The comment annotates a lazy-load mechanism we call internally the Pi pattern. On every fresh session, our MCP server's tools/list response returns lean stub schemas: name, description, and {"type":"object"} for the input shape. Only after a tool has actually been invoked does the server start returning the full JSON Schema for that tool on subsequent tools/list calls. The comment in the code is unambiguous about why:
"A fat schema injection at session start poisons the prompt cache for every later turn. Once a tool is invoked we know the LLM is actually going to use it, so subsequent
tools/listcalls return the full description + schema for that tool. This is the lazy-load substitute for changing the wire format."
Read the comment again. The wire format injects schemas the model does not need into a prompt cache that has to hold them anyway. The cost is paid every turn, for every tool, whether or not the model invokes any of them. Our defense is to ship lies on the first contact and tell the truth only after the lie has cost the model nothing. The defense works. The defense is also Exhibit A that the abstraction is fighting itself.
When the workaround you wrote to make the abstraction survivable is more interesting than the abstraction it defends, the abstraction is wrong for your problem. We had built a workaround. The workaround was good. The architecture it defended was not ours to defend.
II. What MCP Actually Is
MCP, the Model Context Protocol, is a clean JSON-RPC contract between an LLM-facing client (Claude.ai, ChatGPT desktop, Cursor, Cline) and a tool-hosting server. It does three things well. It standardizes the contract so a hosted-agent vendor can ship a single integration surface and let third parties plug in. It provides structured tool schemas so the vendor's autocomplete and validation can work the same way across hundreds of integrations. And it gives the vendor a security boundary they can audit: the server is a process they did not write, talking to their client over a wire format they did write, and they can quarantine it accordingly.
These are real benefits. For hosted-agent vendors, they are decisive. Anthropic, Cursor, OpenAI, and the next ten companies in this space need a clean integration boundary into a UI they control. MCP is the right shape of contract for that audience.
We are not that audience. We are an operator who owns the stack. We are not selling tool integrations to a third-party vendor. We are not asking another company's UI team to validate our schemas. We are running an agent against tools we wrote, deployed on a machine we control, executed against state we own. The hosted-vendor benefits do not accrue to us. The protocol overhead does.
III. The Spread
The agent has shell access. Bash is a native tool in every modern agent harness: Claude Code, Codex, Gemini CLI, Cursor's agent mode, Cline. Reading stax brief --help is a single Bash invocation that returns a structured help text the agent already knows how to parse, costs the agent zero tokens at session start, and updates the moment the underlying binary updates. The same operation through MCP requires a JSON-RPC server process to be alive, an initialize handshake on session start, a tools/list response that fills the prompt cache with every tool's schema whether or not any of them will be used this session, and a per-call tools/call JSON-RPC frame around what is structurally a function call.
The MCP layer is doing real work. It is also doing real work that nobody who owns the stack is asking for. The protocol is the spread-scalper between the agent and the CLI: a layer that does not own the bottleneck, and instead extracts overhead from every call that crosses it.
Concretely, for our sixteen-tool server, the costs we measured or directly observed:
- Schema cache cost. Each tool description with full schema runs ~1.5–2.5K tokens. Sixteen tools, full schemas, baseline
tools/listresponse: roughly 24–40K tokens injected into the prompt cache on every cold-start session. Multiplied by every fresh Claude Code session, every Cursor agent invocation, every Cline run. The Pi pattern reduces this by injecting only stubs first (a real win, on the order of 75% of the cache cost), but the win exists only as a workaround. - Process supervision cost. The MCP server is a long-running process. It needs a launcher (our
stax-mcp-launcherhas a four-tier fallback chain to handle the case where the release binary is missing, the debug binary is missing, and the workspace needs to be built before it can serve). It needs acargo buildstory. It needs to be restarted when its binary changes. It is one more thing to supervise. - Versioning cost. The MCP protocol itself is moving; Anthropic has revised the spec twice in the time we have been operating against it. Each revision is a coordination event between server and client. We do not own that calendar.
- Discoverability cost. Adding a tool requires editing a Rust module, rebuilding the binary, restarting the session. Adding a Bash subcommand requires writing a function and rebuilding the binary. Same work; the MCP path adds JSON-RPC plumbing on every tool.
For a hosted-agent vendor running thousands of third-party integrations, these costs are the cost of doing business. For us, running our own tools against our own state on our own machines, they are pure spread.
IV. The Pi Pattern (Exhibit A: The Architecture Fighting Itself)
Return to the Pi pattern. Here is the actual mechanism, condensed from stax-mcp/tools/mod.rs:
static WARMED: LazyLock<RwLock<HashSet<&'static str>>> =
LazyLock::new(|| RwLock::new(HashSet::new()));
pub fn mark_warmed(name: &'static str) {
if let Ok(mut w) = WARMED.write() {
w.insert(name);
}
}
pub fn is_warmed(name: &str) -> bool {
WARMED.read().map(|w| w.contains(name)).unwrap_or(false)
}
In the tools/list handler, the server checks is_warmed(name) for each tool. If the tool has not been invoked in this session, the response returns the tool's short description and an empty {"type": "object"} input schema. If the tool has been invoked at least once, the response returns the full description and full JSON Schema. The mark_warmed call fires inside the tools/call handler, the first time a tool is touched.
Read what this code is saying. The wire format mandates that the server tell the client about every tool's full schema, every time the client asks. The prompt cache pays the cost of that mandate, every turn, for every tool. The defense is to violate the wire format's intent on the first contact (to ship a structurally valid but semantically empty response) and to repair the violation only after the client has demonstrated, by actually calling the tool, that the cost will be amortized.
This is what the architecture fighting itself looks like at the code level. The defense is not a bug fix. The defense is not a performance optimization in the conventional sense. The defense is a small, careful violation of the protocol's own contract, hidden inside the server because the protocol has no mechanism for the server to say "these are my tools, but you do not need their schemas until you ask."
A protocol that requires its own server implementations to ship structurally-valid lies in order to be tractable at scale is a protocol that has the wrong abstraction for the workload. The Pi pattern is the canary, not the cure.
For comparison, here is the equivalent operation in the post-MCP architecture we are migrating to:
# Discovery: agent runs this once when it cares about a tool.
$ stax brief --help
# Invocation: agent runs this when it wants the work done.
$ stax brief render --date 2026-05-12 --format json
There is no schema to warm. There is no protocol to version. There is no process to supervise. The discovery surface is --help, which the agent reads on demand and the OS caches in the filesystem for free. The invocation surface is the binary. The cost of adding a tool is the cost of writing a function in the binary and shipping a new release.
V. The Three-Layer Alternative
We are replacing the MCP layer with a three-layer architecture that is honest about which layer owns which workload.
Layer 1, Zig CLI primitives. A single static binary, stax, with subcommands replacing each of the sixteen MCP tools: stax brief render, stax schedule today, stax memory grep, stax telemetry today, stax pnl rollup, and the rest. --help is the discovery surface. Stdout is JSON when --format=json is passed, plain text otherwise. Exit codes are predictable. The binary distributes through channels operators already own and trust: a Homebrew tap, a nixpkgs derivation, an AUR PKGBUILD, a crates.io-style direct release for the Zig ecosystem. The agent is exactly as capable as it would be with the MCP server, and it pays none of the schema-cache cost on cold-start. We chose Zig over Rust for the rewrite for three reasons: cross-compilation to every target triple is a one-line command, single-binary static linking with the system C library is the default, and the language's commitment to explicit memory and explicit control matches the layer's role as the foundation everything else stands on.
Layer 2, Elixir/BEAM orchestration. The orchestration surface that previously hid in scattered launchd plists, ad-hoc Bash hooks, and Python scripts moves into an OTP-supervised Elixir application. Long-running watchers, scheduled invocations, fan-out of one event to multiple stax X subprocess calls, hot-swappable supervision trees: these are the workloads BEAM was built for. Each stax X invocation runs as a supervised Port subprocess; if it crashes, OTP restarts it; if we deploy a new binary, BEAM picks it up without restarting the supervisor. A Phoenix LiveView dashboard renders live job state for the operator. This is the layer the MCP server pretended to be and was not. MCP gave us a single long-running process with no supervision tree, no hot-swap, no fan-out, no scheduling, just a serializer in front of function calls.
Layer 3, Stax meta. Doctrine, defaults, brand identity. The naming conventions, the evidence-level discipline, the Type-I/Type-II audit lens, the decision matrices that tell future implementations which layer a new workload belongs in. This is the layer that catches the next "should we wrap this as an MCP server?" question and routes it through the working test: could this be a Zig CLI subcommand with --help and stdout? If yes, that is the path. If no, document why.
VI. What Survives
MCP is not dead in general. MCP is dead as our primary integration layer. The protocol is exactly the right shape of contract for the audience that needs it: hosted-agent vendors who need a clean, audited, schema-typed boundary into third-party tools they did not write. Anthropic's MCP servers for Google Drive and Canva still run in our agent harness, and they should; those tools live behind OAuth and remote APIs, on the other side of a vendor boundary we do not control. The Canva integration is exactly the case MCP was designed for. The Google Drive integration is exactly the case MCP was designed for. The integration with our own brief renderer, our own memory store, our own P&L rollup, our own telemetry: those were never the case MCP was designed for. We just used MCP for them because the documentation pointed that direction and the alternatives were less obvious.
The Pi pattern survives, too, as a learning artifact. The stax-mcp crate stays in the monorepo with a RETROSPECTIVE.md explaining what we built, what we learned, and why we replaced it. The cache-warming mechanism is publishable engineering for any MCP-server author who, for their own reasons, still wants to live inside the protocol. We have no further need of it.
VII. The Migration
The migration plan is mechanical and bounded. Sixteen MCP tools become sixteen stax subcommands. The Rust implementations of the tool logic mostly stay intact; we are not rewriting stax-brief or stax-memory from scratch; we are removing the JSON-RPC plumbing around them and exposing them through a Zig argument router that calls into the existing logic via FFI (and over time, ports the logic to Zig where it earns the move).
The fourteen Anthropic Skills and thirty-plus subagent manifests that previously declared mcp__stax__* tool allowlists get a sweep: their tool surface becomes Bash, and their invocations become stax X --help followed by stax X --format=json. The agent reads the help text and runs the command. The agent's existing pattern-matching against descriptions still works exactly the same way; it never needed the JSON Schema to dispatch, it just needed the description, which it still has.
The ~/.claude/hooks/ scripts and the slash commands at ~/.claude/commands/ get the same sweep. /mcp-rebuild becomes /stax-rebuild. /curate still calls a binary; that binary is now stax curate. The Stop hook that auto-checkpoints session state calls stax journal checkpoint instead of inline cargo check + git commit.
Layer 2 (the Elixir orchestrator) is new work. We are building stax-conductor as an OTP application that supervises long-running stax X invocations, with a Phoenix LiveView dashboard for live job state. It replaces seven launchd plists and four Python ad-hoc daemons. First production cut is a thirty-day reach.
The whole migration is roughly one to two hundred hours of work spanning ninety to one hundred eighty days. The payoff is owning the bottleneck instead of paying the spread.
VIII. Falsification
This essay's central claim is that MCP is the wrong abstraction for operators who own the stack. The claim is falsifiable. It is falsified if any of the following turns out to be true.
If the Pi-pattern cache savings prove to be smaller than five percent of total session token cost (meaning the schema-injection problem we identified is real but trivial), the architectural argument weakens substantially. We expect the savings to be much larger; the public benchmark is owed to readers and will run inside the next thirty days.
If the Zig CLI rewrite turns out to require sustaining a JSON-RPC layer for some non-obvious reason (for example, if an agent harness we want to support drops Bash access in favor of MCP-only tool dispatch), the working test fails for that surface, and we will document the exception.
If hosted-agent vendors successfully ship server-side schema-warming in the protocol itself (Anthropic could plausibly do this in a future MCP revision), the Pi-pattern argument becomes a historical artifact, but the spread-scalping argument remains, because the protocol is still wrong-shape for stack-owners regardless of whether its cache cost is fixed.
If the Elixir orchestration layer cannot be made to supervise stax X Port subprocesses at the latency we need for the brief-rendering pipeline, the three-layer split is wrong at Layer 2, and we will reconsider the orchestrator language without reconsidering Layer 1.
Until those conditions trigger, the doctrine stands: scalping the spread is what Counter-Example merchants do; owning the bottleneck is what merchants do. The Pi pattern was good engineering. It was also the warning. We are not staying inside the layer that needed it.
Postscript: what Layer 1 looks like once we stopped paying the wrapper spread
The AI inference stack in Zig is the worked example of this doctrine for the inference path itself: four single-purpose Zig 0.16 libraries (safetensors-zig → tokenizers-zig → vllm-zig → faiss-zig) that compose into real TinyLlama inference on CPU, no Python in the serving path, no MCP between the kernels and the orchestrator. Each repo carries its own pre-1.0 substrate framing and explicit non-claims. They are not a vLLM replacement. They are the Layer 1 that the three-layer architecture above this postscript called for, demonstrated.