All posts
AI & ML

Semantic Caching: Cut Your LLM Bill Without Hurting Quality

Teams are quietly killing AI features — not because they don’t work, but because the token bill doesn’t justify them. Semantic caching is the fix: serve a cached answer when someone asks the same thing in different words.

Dhileep Kumar7 min read
Semantic Caching: Cut Your LLM Bill Without Hurting Quality

There’s a failure mode for AI features that has nothing to do with whether they work. They work fine — and then someone looks at the monthly token bill, divides it by the value the feature delivers, and quietly shuts it off. In 2026 that’s happening a lot, and the culprit is usually the same: the app pays a frontier model to generate an answer it has already generated, over and over, because each user phrased the question slightly differently.

Semantic caching is the fix, and the idea is almost obvious once you see it. A normal cache keys on the exact input — same string in, cached value out. That barely helps for natural language, where “what are your hours? ” and “when are you open? ” are the same question wearing different words. A semantic cache keys on meaning instead: if a new question is close enough to one you’ve answered before, you return the stored answer and skip the model entirely.

Why exact-match caching fails for LLMs

The instinct to cache LLM calls is right; the instinct to cache them like API responses is wrong. Human language defeats exact matching in ways that tank your hit rate.

  • Infinite phrasings. The same intent has a thousand surface forms. Exact-match keys treat every one as a brand-new request, so the cache almost never hits.
  • Typos and casing. “Cancel my order” and “cancel my oder” should hit the same entry; to a string comparison they’re different keys.
  • Word order. “hours on Sunday” and “Sunday hours” mean the same thing, and exact matching does not care.
  • Low hit rate, high cost. With exact keys your cache catches only literal repeats — a tiny fraction of traffic — and you keep paying for the rest.

How semantic caching works

A semantic cache sits in front of the model and works in embedding space. When a request comes in, you embed it and search your cache of past questions for the nearest one. If the closest match is within a similarity threshold you’ve set, you return its stored answer — no model call, near-zero cost, instant response. If nothing is close enough, you call the model, then store the new question-and-answer pair so the next person who asks it differently gets the cached version.

The whole system turns on one number: the similarity threshold. Set it too loose and you’ll serve the answer to “how do I cancel? ” for a question about refunds; set it too tight and you’re back to almost never hitting. Tuning that threshold against real traffic is the actual work of semantic caching — the mechanism itself is a few lines.

A minimal semantic cache

Strip it to the essentials and a semantic cache is: embed, search, threshold, store. Here’s the core loop, with a list standing in for the real vector store:

python
# Semantic cache: reuse an answer when a question means the same thing.
cache = []   # in production: a vector database

THRESHOLD = 0.92   # tune this against real traffic

def answer(question):
    q = embed(question)
    # Find the most similar question we've already answered.
    best = max(cache, key=lambda c: similarity(q, c["vec"]), default=None)
    if best and similarity(q, best["vec"]) >= THRESHOLD:
        return best["answer"]          # cache hit: no model call
    result = call_model(question)      # miss: pay for it once
    cache.append({"vec": q, "answer": result})
    return result

That threshold of 0.92 is doing all the heavy lifting, and the right value depends entirely on your data — there’s no universal number. The discipline is to measure: log hits and misses, sample the hits to make sure they’re actually answering the question that was asked, and move the threshold until the savings are real and the wrong answers are rare.

A semantic cache is a bet that your users ask the same things in different words. They almost always do — which is why the feature you were about to kill for cost can usually be saved with a threshold and an embedding.

Where it bites

  • The threshold is everything. Too loose and you serve confidently wrong answers; too tight and you save nothing. Tune it on real traffic, not a guess.
  • Personalized answers. If the response depends on who’s asking, a shared cache leaks one user’s answer to another. Key the cache per user, or don’t cache those.
  • Stale entries. Cached answers age. When the underlying facts change, the cache keeps serving the old answer — add expiry, or invalidate on update.
  • Caching the wrong thing. Time-sensitive or one-off questions shouldn’t be cached at all. Decide what’s cacheable before you cache it.
  • No quality check on hits. A hit that returns a subtly wrong answer is worse than a miss. Sample your cache hits the way you’d sample model output.

Worth it?

For any app with repetitive questions — support, search, FAQs, internal tools — semantic caching is one of the highest-leverage changes you can make. Hit rates of thirty to sixty percent are common, which means a third to half of your model calls simply stop happening, along with their cost and their latency. The feature that didn’t justify its bill often justifies it easily once it stops paying twice for the same answer.

The mistake is treating cost as fixed — as the price of doing AI — when a large share of it is just the same answer generated again and again. Semantic caching turns that waste into a cache hit. It won’t help an app where every question is unique, but most apps aren’t like that. Most apps ask the same handful of things, endlessly, in slightly different words — and that’s exactly the pattern a semantic cache was built to catch.

Share

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter

Comments