AI & ML

Context Engineering in Practice: What Building a Token-Compression Layer Taught Me About the Window

Context engineering sounds abstract until you have to ship it. Building Headroom -- a local-first layer that shrinks tool outputs, logs, and RAG chunks by an advertised 60-95% before they reach the model -- turned tidy context-window advice into concrete design calls: reversible compression, a cache you can torch with one byte, and savings you literally cannot measure head-on.

Dhileep KumarJun 10, 20268 min read

Context Engineering in Practice: What Building a Token-Compression Layer Taught Me About the Window

Everyone agrees the context window is a scarce resource. You've read the posts: context rot, lost-in-the-middle, cost that scales with every token. It's all true, and it all stays comfortably abstract until you have to build a system whose entire job is deciding what the model sees. I built one. Headroom is a local-first compression layer that sits between an AI agent and its LLM and shrinks everything the agent reads -- tool outputs, logs, RAG chunks, files, conversation history -- by an advertised 60-95% before it reaches the model. Shipping it turned a lot of tidy context-engineering advice into concrete, sometimes uncomfortable, engineering decisions.

This is the version I wish I'd read first. The concept is real, and the four canonical moves -- select, structure, compress, isolate -- are a genuinely good map. But building them surfaced trade-offs the blog version glosses over: compression that has to be reversible or you don't dare use it, a provider cache you can quietly torch by editing one byte, and 'savings' numbers you cannot measure head-on no matter how badly you want a clean chart. Here's what actually happened, with the real numbers and their caveats attached.

Compress -- but keep a way back

The standard advice says compress: summarize old turns, distill long documents, drop the redundant. The instinct is right. The usual implementation -- ask the LLM to summarize -- is where I diverged hard. LLM summarization is non-deterministic, adds latency, adds API cost, and can hallucinate a fact that wasn't in the source. For a layer that runs on every single request, that combination is disqualifying.

So Headroom's compression does no LLM calls at all. A ContentRouter detects each content type and dispatches it to a specialized deterministic compressor: SmartCrusher does statistical analysis on JSON and arrays (factoring out constant fields, preserving anomalies, spikes, and errors), a CodeCompressor does AST-aware trimming with tree-sitter (keeping imports and signatures), and an opt-in ModernBERT model handles prose. Everything is pattern-matching and rule-based, which buys predictable output and, critically, no added token bill for the compression itself.

But deterministic still means lossy, and lossy means you will sometimes guess wrong about what mattered. The move that made me trust the whole system is CCR -- Compress-Cache-Retrieve. When SmartCrusher compresses something, it stashes the full original in a local store keyed by a 16-character SHA256 hash, with a default 5-minute TTL and LRU eviction. The model can call a headroom_retrieve tool (or hit a retrieve endpoint) to pull the original back on demand, and there's even a BM25 search over the cached originals. Compression stops being a one-way door. The framing I kept returning to: worst case, the model retrieves everything and you're no worse off than if you hadn't compressed at all.

Reversible compression changes the risk calculus entirely. A wrong guess about what mattered isn't data loss -- it's one extra tool call. That single property is what let me compress aggressively instead of timidly.

This is the compress move from the concept posts, but with a safety net the tidy version never mentions. And the numbers it buys are real: on a SmartCrusher eval of 100 production log entries with a critical error deliberately planted at position 67, the input dropped from 10,144 tokens to 1,260 -- 87.6% -- while all four evaluation questions were still answered correctly, error included. To be precise about provenance: that's Headroom's own eval on an Apple M-series machine, not an independent reproduction. I'll flag which numbers are self-reported throughout.

The 'structure' move has one hard rule: never touch a cached byte

Here's where context engineering collided with provider billing in a way no concept post prepared me for. Provider prompt caches -- Anthropic cache_control, OpenAI automatic prefix caching, Google CachedContent -- give enormous discounts for reusing a stable prefix. Anthropic reads a cached block at a 90% discount; OpenAI's prefix cache is a 50% discount but needs a byte-identical prefix of at least 1024 tokens; Google's is 75% but wants a 32,768-token minimum cache. These are not rounding errors. Getting the cache to hit is often a bigger win than the compression is.

The catch: a prompt cache is keyed on bytes. One dynamic token -- today's date, a session UUID, a request ID -- sitting in your system prompt changes the prefix on every call, so the cache never hits, and you quietly pay full price forever while believing you're cached.

My first design did the obvious 'smart' thing: detect the volatile values in the system prompt, strip them out, and re-append them at the tail so the prefix goes stable. It worked. I deleted it anyway. It violated an invariant I'd written down as I2 -- the cache hot zone (the system prompt) must never be mutated -- because the moment you rewrite a cached byte you change the cache key, drop the hit rate to zero, and, in the words of the Rust core's own comment, silently torch the customer's bill. The optimization would have broken the exact thing it was optimizing. The CacheAligner is now strictly a detector: it finds volatile content, warns you, and changes nothing.

python

if all_findings:
    counts_str = ", ".join(f"{k}={v}" for k, v in sorted(counts.items()))
    msg_text = (
        f"CacheAligner: detected volatile content in system prompt "
        f"({counts_str}); cache prefix unstable. "
        "Move dynamic values out of the system prompt to recover cache hits."
    )
    warnings.append(msg_text)
    logger.warning(msg_text)

One detail I'm oddly proud of: the volatile-content detection uses real parsers, not regex. A 32-character dashless string could be a UUID or an MD5 hash, and a regex would happily mislabel it. Headroom defers to the standard-library UUID parser and only accepts the canonical 36-character dashed form, so hashes get classified as hashes and cache-busting IDs get classified correctly.

python

def _is_uuid(token: str) -> bool:
    # Accepts only the canonical 36-char form with dashes. The 32-char
    # dashless form is indistinguishable from an MD5 hex digest and would
    # misclassify hashes; we treat that case as a hex hash instead.
    if len(token) != _UUID_CANONICAL_LEN:
        return False
    if token.count("-") != 4:
        return False
    try:
        _uuid.UUID(token)
    except (ValueError, AttributeError):
        return False
    return True

What Headroom does mutate is the safe, structural stuff outside the hot zone, and only on pay-as-you-go auth: the Rust proxy sorts the tools array alphabetically, recursively sorts JSON-schema keys (while preserving genuinely ordered arrays like oneOf), and places the cache_control markers deterministically. The freeze logic is blunt about the contract -- markers in the system and tools blocks are unconditionally part of the hot zone and never bump the freeze floor; only per-message content markers do.

rust

//! Headroom's compressor must **never** modify any byte that's part
//! of that prefix -- doing so changes the cache key, drops the hit
//! rate to 0, and silently torches the customer's bill.
pub fn compute_frozen_count(parsed: &Value) -> usize {
    let mut highest_message_index: Option<usize> = None;
    walk_messages(parsed, &mut highest_message_index);
    walk_system(parsed);  // logging + TTL check only -- never bumps floor
    walk_tools(parsed);   // logging + TTL check only -- never bumps floor
    highest_message_index.map(|i| i + 1).unwrap_or(0)
}

The numbers that hold up -- and the honesty tax on the ones that don't

Input compression is measurable, so I can show you real workloads. From Headroom's proof table on actual agent traffic:

Code search, 100 results: 17,765 -> 1,408 tokens (92%)
SRE incident debugging: 65,694 -> 5,118 tokens (92%)
GitHub issue triage: 54,174 -> 14,761 tokens (73%)
Codebase exploration: 78,502 -> 41,254 tokens (47%)

The spread is the point. Compression is not a flat percentage -- it's a function of how redundant the content is. Codebase exploration barely compresses because code is already dense. And that leads to a caveat the concept posts never mention: some content should compress to zero. Grep results and source code get passed through unchanged, and the docs deliberately flag that '0.0%' line because a naive reader sees it and assumes something broke. Zero is correct there; it means the compressor refused to damage already-compact data.

Accuracy is the other axis, because compression is worthless if it changes answers. On Headroom's own runs at N=100, GSM8K math held at 0.870 before and after (delta zero) and TruthfulQA actually ticked up from 0.530 to 0.560 -- plausibly because trimming noise removed some distractors. These are self-reported evals captured on different tool versions, so treat them as the project's own measurements, not a peer-reviewed benchmark or a single coherent run.

Now the genuinely honest part. Output-token savings -- trimming what the model writes back -- are unmeasurable head-on, because you never observe the response the model would have written without your intervention. There is no ground truth for a counterfactual. So Headroom does not quote a hard number: it reports output savings as an estimate with a 95% confidence interval, and if you want a real measured figure it ships a randomized holdout that leaves 10% of conversations unshaped as a control group. That's the most defensible way I know to talk about a counterfactual metric, and it's the opposite of the confident round numbers you usually see in this space.

The gotchas the tidy version skips

A few things I only learned by shipping, none of which fit the clean four-moves narrative:

Compression can cost more latency than it saves. The break-even is explicit: on fast, cheap models the compression overhead exceeds the LLM prefill time it saves in every scenario I measured. The latency win only appears on slower, pricier models where prefill dominates -- and even then one JSON-array scenario stayed net-negative. 'Compress everything' is wrong advice; 'compress when the model is expensive enough to earn it' is right.
Your docs will drift from your code. Headroom's own architecture docs still describe the system-prompt-rewrite behavior I deleted, while the code documents that path as removed. I caught it because I was reading source, not prose -- a standing reminder that a nice writeup can describe a system that no longer exists.
Corporate networks break the install. Behind an SSL-inspecting proxy, the Rust build backend fetches its toolchain over a MITM'd TLS connection and dies with CERTIFICATE_VERIFY_FAILED; two runtime model assets need CA-bundle trust too. None of that is on the happy path, and all of it is where real users get stuck.
Telemetry is off by default and everything runs locally. For a tool that sees every token of your prompts, 'your data stays here' isn't a marketing bullet -- it's the precondition for anyone trusting it at all.

The concept posts are right that the context window is the one part of an LLM app you fully control, and that treating it as a scarce, curated resource is the discipline that separates a demo from a product. What building Headroom taught me is that the discipline is mostly made of unglamorous invariants: keep compression reversible so a wrong guess costs a tool call, not your data; never mutate a cached byte even when it looks clever; and refuse to quote a number you can't actually measure. Select, structure, compress, isolate is the map. The invariants are the terrain you learn by walking it.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter

Compress -- but keep a way back

The 'structure' move has one hard rule: never touch a cached byte

The numbers that hold up -- and the honesty tax on the ones that don't

The gotchas the tidy version skips

Enjoyed this?

Comments