AI & ML

Model Context Protocol Explained: What I Learned Shipping a Real MCP Server

MCP is USB-C for AI, but the abstract version teaches you nothing. Here's what actually happened when I shipped an MCP server for Headroom, my context-compression tool: why three narrow tools beat one, why I deleted a clever cache trick, and how a reversible compress/retrieve pair made a lossy transform safe to expose to an agent.

Dhileep KumarJun 10, 20267 min read

Model Context Protocol Explained: What I Learned Shipping a Real MCP Server

Eighteen months ago, wiring a language model to your database, your logs, and your file store meant writing three bespoke integrations and then rewriting them for the next model and the next framework. Every team rebuilt the same plumbing, slightly differently, and none of it was reusable. The Model Context Protocol drained that swamp: it is an open client-server standard for how AI apps talk to external tools and data, and it spread fast enough that by late 2025 the spec had been handed to a neutral foundation to steward.

I am not going to explain MCP from the outside. I actually shipped an MCP server, as one of three delivery modes for a context-compression tool I built called Headroom, and I also consumed MCP from the client side in a little VS Code extension I wrote. That combination taught me things the top Google results skip. So this post is the mental model plus what actually happened when I ran the protocol through a real project.

What MCP actually is (the mental model)

The analogy the community settled on is USB-C for AI: one port that speaks to everything. The host application, your IDE or your agent, runs an MCP client; each capability you expose runs behind an MCP server. They talk JSON-RPC over a defined transport, and the model discovers what is available at runtime rather than you hard-coding a tool list into every app. The protocol gives you a few primitives to expose:

Tools: functions the model can call, with typed inputs. This is the primitive people reach for first, and the one I leaned on hardest.
Resources: read-only data the host can pull into context (files, records, documents). Addressable context, not actions.
Prompts: reusable parameterized templates the server offers so common workflows are not re-typed.
Transport: stdio for local servers (the host launches your server as a subprocess) or streaming HTTP for remote ones, which is where production is heading.

The math is why it spread. The old world was M times N, every model paired with every tool through a custom integration. MCP makes it M plus N: write a server once and every MCP-aware client can use it. That asymmetry is real, and I felt it directly, because I did not write a client for Headroom at all. I wrote the server, and existing MCP hosts could already speak to it.

The server I actually shipped, and the one design call that mattered

Headroom compresses everything an agent reads (tool output, logs, RAG chunks, JSON) before it reaches the model, so you pay for fewer tokens without changing the answer. It runs three ways that share one pipeline: an inline library, a drop-in OpenAI/Anthropic-compatible proxy, and an MCP server. The MCP surface is deliberately small. It exposes three tools, not a wrapper around the whole system:

python

# The three tools my headroom MCP server exposes.
# Note: compress is reversible. The originals are cached locally
# (16-char SHA256 key, ~5-minute TTL, LRU eviction) so the model
# can pull anything back with headroom_retrieve. Worst case = retrieve everything.
TOOLS = [
    {
        "name": "headroom_compress",
        "description": (
            "Compress tool output, logs, JSON, or code before it enters "
            "context. Deterministic (no LLM call), reversible, keeps errors/anomalies."
        ),
    },
    {
        "name": "headroom_retrieve",
        "description": "Pull back the full original for a compressed block by its hash key.",
    },
    {
        "name": "headroom_stats",
        "description": "Report tokens in vs out and the compression ratio for this session.",
    },
]

Three narrow tools, not one god-tool, is not a stylistic choice. The single most useful thing I learned building an MCP server is that the model picks tools from their descriptions, so the description is part of the API, not a comment. A vague docstring causes wrong calls. And the pairing here encodes a real safety property: compress is lossy, so on its own it would be a one-way door. The retrieve tool is what makes it survivable. Compression is reversible because SmartCrusher keeps the full original in a local store keyed by a 16-char SHA256 hash with a default five-minute TTL, and the model can call headroom_retrieve to pull it back. My own note in the design docs put it bluntly: worst case equals retrieve everything.

The MCP server is not a convenience layer, it is an attack surface. The model decides what to call; your server decides what is allowed. I only understood that in my gut once my server could silently drop data a model needed.

That reversibility is the whole reason I trusted an MCP tool to sit between an agent and its data at all. Compression is deterministic by design: there is no LLM call inside it, it is all statistical analysis, pattern matching, and rule-based transforms. That gives predictable output and no added API cost, but deterministic and lossy still means it can guess wrong on some weird input. Making the guess reversible, and exposing the undo as its own MCP tool, is how a compressor stops being scary.

The gotcha that cost me a whole code path

Here is the war story I would not have gotten from a tutorial. Headroom also tries to keep provider prompt caches actually hitting, which matters because a cache hit is a 90% read discount on Anthropic and a 50% discount on OpenAI. Cache hits require a byte-identical prefix. So my original CacheAligner was clever: it detected dynamic content in the system prompt (today's date, a UUID, a session id), stripped it out, and re-appended it at the tail so the cached prefix stayed stable.

I deleted that code. It violated a core invariant I had written for myself: the cache hot zone, the system prompt, must never be mutated. The problem is subtle and expensive. Mutating even one byte of a cached prefix changes the provider's cache key, which drops the hit rate to zero and, in my own words in the source, silently torches the customer's bill. The rewrite that was supposed to save cache hits was the thing most likely to destroy them. So the transform became detector-only: it now finds the volatile content and warns, but never touches the prompt.

python

if all_findings:
    counts_str = ", ".join(f"{k}={v}" for k, v in sorted(counts.items()))
    msg_text = (
        f"CacheAligner: detected volatile content in system prompt "
        f"({counts_str}); cache prefix unstable. "
        "Move dynamic values out of the system prompt to recover cache hits."
    )
    warnings.append(msg_text)
    logger.warning(msg_text)

Two lessons fell out of that. First, for anything an MCP server exposes to an agent, prefer detect-and-warn over silent auto-fix when the fix could corrupt something you cannot see. Second, be honest that docs drift: parts of my own architecture docs still describe the old rewrite behavior as if it were live, and only the code tells you the truth. If you build an MCP server, assume the same about everyone else's: trust the code, not the prose.

On the client side, and on honest numbers

The client side taught me the flip lesson. In a separate project, a roughly 300-line VS Code extension I built as a minimal take on Cursor, I was the MCP consumer rather than the provider. There the hard part was not the protocol, it was context: when no text is selected I send the whole active file but cap it at 12,000 characters to keep tokens sane. That cap is a blunt instrument, and it is exactly the kind of thing a well-designed MCP tool (compress-then-retrieve) would replace. Standing on both sides made the protocol click: the client is only as good as the tools it can discover, and the server is only as good as the descriptions and safety rails it ships with.

Now the honesty caveat, because it is the whole point of writing from real experience. Headroom's headline numbers are real but self-reported and captured on different versions, so I will not launder them into one clean benchmark. On one representative JSON eval, 100 production log entries with a critical error buried at position 67 compressed from 10,144 tokens to 1,260 (about 87.6%) while all four evaluation answers were still preserved. That is the shape of the win: big token cuts without losing the error that matters. But some of those numbers come from a v0.5.18 suite, some from an older v0.3.7 latency run, and the fleet-wide figures are production telemetry I collected, not an independent audit. Treat them as my measurements, not laws of physics.

And compression is not free. My own break-even analysis says the latency win only shows up on slower, pricier models; on fast cheap ones the compression overhead can exceed the prefill time it saves. There are even inputs, like grep output and source code, where the compressor reports 0% because they are already compact, and passing them through untouched is the correct behavior, not a failure. Building the server forced me to measure the cases where my own tool should do nothing.

That is the information gain you cannot get from the top ten results. MCP in the abstract is USB-C for AI. MCP in practice is a small set of narrow tools with descriptions the model actually reads, a hard rule about which bytes you are never allowed to touch, an undo tool that makes a lossy transform safe to expose, and a discipline of reporting numbers honestly even when they are your own. Build one server, wire one client, and the agentic ecosystem stops looking like magic and starts looking like something you can ship, and reason about, and occasionally have to delete a clever idea from.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter

What MCP actually is (the mental model)

The server I actually shipped, and the one design call that mattered

The gotcha that cost me a whole code path

On the client side, and on honest numbers

Enjoyed this?

Comments