AI & ML

Semantic Caching Is the Second Cache: What I Learned Cutting an Agent's Token Bill

Everyone reaches for semantic caching to cut LLM costs. Building Headroom, an open-source context-compression proxy, taught me the bigger win for agents is the provider prompt cache -- and the fastest way to lose it is to "helpfully" rewrite your system prompt. A first-hand look at byte-determinism, a mistake I shipped then deleted, and why I detect volatile tokens with parsers instead of regex.

Dhileep KumarJun 12, 20267 min read

Semantic Caching Is the Second Cache: What I Learned Cutting an Agent's Token Bill

There is a failure mode for AI features that has nothing to do with whether they work. They work fine, and then someone looks at the monthly token bill, divides it by the value the feature delivers, and quietly turns it off. Semantic caching is the usual first answer: if a new question means the same thing as one you already answered, return the stored answer and skip the model. It is a good idea. But when I actually went to shave the token bill on agent workloads, semantic caching turned out to be the wrong layer to start at, and I want to explain why with numbers from the thing I built.

I maintain an open-source project called Headroom. It is a local-first context-compression layer that sits between an AI agent and the LLM and shrinks everything the agent reads (tool output, logs, RAG chunks, files, conversation history) before it reaches the model. It ships three ways that share one pipeline: an inline library you call as compress(messages), a drop-in OpenAI/Anthropic-compatible proxy you run with headroom proxy --port 8787 and zero code changes, and an MCP server exposing headroom_compress, headroom_retrieve, and headroom_stats. Building it forced me to be precise about which caching problem I was actually solving, because there are two, and people conflate them.

Two different caches, and the one nobody tunes

Semantic caching answers the question "have I seen this meaning before? " and it lives in your application. Provider-side prompt caching answers a much dumber question: "is the prefix of this request byte-for-byte identical to one I saw recently? " and it lives at the model vendor. These are not the same lever. Semantic caching skips whole model calls when a user rephrases a question. Prompt caching discounts the tokens you resend on every single turn of a long conversation, whether or not any human ever repeats themselves.

For a chatbot with lots of repeat FAQs, the semantic cache is the big win. But for an agent (which re-sends a giant system prompt, a tool schema, and a growing message history on every turn) the provider prompt cache is where the money is, and it is astonishingly easy to break by accident. That was the surprising part of building Headroom: the highest-leverage caching fix was not a similarity threshold. It was making requests byte-deterministic so the provider cache would actually hit.

The economics are worth internalizing, because they differ per provider. From the provider docs I had to model against: Anthropic cache_control blocks give a 90% read discount but charge a 25% write premium with a 5-minute TTL; OpenAI automatic prefix caching gives 50% off but demands a byte-identical prefix and a 1024-token minimum; Google's CachedContent API gives 75% off but needs a 32,768-token minimum cache. A single dynamic token in the wrong place forfeits all of that.

One volatile token (today's date, a UUID, a session ID) near the top of your system prompt changes the cache key, drops the provider hit-rate to zero, and silently torches the bill. The cache you never tuned is usually the one costing you the most.

The mistake I shipped, then deleted

Headroom's cache-stabilization component is called the CacheAligner. Its job is to keep the provider prompt cache hitting. My first version was clever in exactly the wrong way: it detected dynamic content in the system prompt (dates, UUIDs), stripped it out, and re-appended it at the tail so the stable part of the prefix stayed constant. It worked in a demo. Then it collided with a hard invariant I had written down for myself: the cache hot zone (the system prompt) must never be mutated. Rewriting it, even to "help" the cache, changes the very bytes the provider hashes to build the cache key. You are trying to save the cache and you are the one breaking it.

So I deleted the rewrite path. The CacheAligner is now a detector only: it finds volatile content, emits a warning, and never touches the prompt. The rule I enforce is blunt, and I keep it in the Rust core as a comment so future-me does not get clever again:

rust

//! Headroom's compressor must **never** modify any byte that's part
//! of that prefix -- doing so changes the cache key, drops the hit
//! rate to 0, and silently torches the customer's bill.
pub fn compute_frozen_count(parsed: &Value) -> usize {
    let mut highest_message_index: Option<usize> = None;
    walk_messages(parsed, &mut highest_message_index);
    walk_system(parsed);  // logging + TTL check only -- never bumps floor
    walk_tools(parsed);   // logging + TTL check only -- never bumps floor
    highest_message_index.map(|i| i + 1).unwrap_or(0)
}

An honest footnote: some of my own docs still describe the old rewrite behavior as if it ships. It does not. The detector-only code is what actually runs, and the prose drifted behind it. I mention that partly as a mea culpa and partly because "the docs describe what the code used to do" is one of the most common ways caching advice goes stale on the internet, including mine.

Detecting volatile content without regex

If you are only going to warn, you had better be right about what is volatile, because a false positive teaches people to ignore your warnings. I made an explicit build constraint: pattern detection uses parsers, not regex. A regex that "looks like a UUID" will happily flag a 32-character MD5 hash. So instead of matching shapes with a pattern string, the detector defers to the standard library. UUIDs go through uuid. UUID, timestamps through datetime. fromisoformat, and the UUID check accepts only the canonical 36-character dashed form precisely so it cannot swallow a hex hash:

python

def _is_uuid(token: str) -> bool:
    # Accepts only the canonical 36-char form with dashes. The 32-char
    # dashless form is indistinguishable from an MD5 hex digest and would
    # misclassify hashes; we treat that case as a hex hash instead.
    if len(token) != _UUID_CANONICAL_LEN:
        return False
    if token.count("-") != 4:
        return False
    try:
        _uuid.UUID(token)
    except (ValueError, AttributeError):
        return False
    return True

The other half of stabilization is boring and mechanical, which is why it works. The proxy sorts the tools array alphabetically by name, recursively sorts JSON-schema keys (while preserving genuinely ordered arrays like oneOf), and auto-places up to four Anthropic cache_control breakpoints at the natural boundaries: system, tools, the stable-history line, and the latest user message. Two turns that were logically identical but serialized in a different key order are now byte-identical, so they hit the same cache entry instead of two.

The one guardrail I would not skip

There is one detail that gates all of the above, and I learned it the cautious way. All the automatic cache_control placement and prompt_cache_key injection only runs in pay-as-you-go auth mode. On OAuth or subscription auth, Headroom forwards the request byte-for-byte and does nothing, because "helpfully" adding cache markers there can void scope or clobber the client's own caching. When you are not sure your optimization is safe for a given caller, the correct optimization is none. That instinct saved me more grief than any clever transform did.

Separate the two caches. Semantic caching skips repeated meanings in your app; prompt caching discounts resent prefixes at the provider. Agents usually need the second one first.
Guard the hot zone. Never mutate the system prompt to "fix" caching. Detect volatile content and warn; do not rewrite. Mutating a cached byte is how you drop the hit rate to zero.
Make requests deterministic. Sort tools, sort schema keys, place cache markers at fixed boundaries. Byte-identical prefixes are the whole game for OpenAI's 50% and Anthropic's 90% discounts.
Detect with parsers, not regex. A UUID regex will flag an MD5 hash; uuid. UUID will not. False positives train people to ignore you.
Do nothing when unsure. Gate risky optimizations behind the auth modes where they are provably safe, and no-op everywhere else.

Where semantic caching still wins

None of this means semantic caching is wrong. For an app full of repeated intents (support, FAQs, internal search) keying on meaning instead of exact strings genuinely stops you from paying twice for the same answer, and the mechanism really is a few lines: embed the request, find the nearest past question, and if it is inside a similarity threshold you tuned on real traffic, return the stored answer. The threshold is the whole job; set it too loose and you serve confidently wrong answers, too tight and you save nothing. Sample your cache hits the way you would sample model output.

What building Headroom changed for me is the ordering. Before you reach for the semantic cache, check whether you are quietly paying full price for a prefix the provider would happily discount by 50 to 90 percent, if only you stopped changing one byte of it every turn. That fix is not glamorous and there is no embedding model in it. It is just refusing to mutate the bytes the cache is counting on, and letting determinism do the rest. The cheapest token is the one you were about to resend and the provider already had cached.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter