Software

The AI Gateway You Already Built By Accident (And How to Design It On Purpose)

An AI gateway is not really an API gateway with a new logo. It is a money meter and a blast-radius controller for language models. Here is a mental model, a worked routing example, and the failure modes that only show up at 2am -- including the timeout trap and semantic-cache poisoning the docs gloss over.

Dhileep KumarJun 9, 20267 min read

The AI Gateway You Already Built By Accident (And How to Design It On Purpose)

Here is the pattern nobody plans for. You ship one LLM feature with a helper function. It works. Someone adds a retry. Then a fallback model for when the primary is slow. Then per-team keys, a spending cap, a cache, and a log of every prompt so you can debug the weird outputs. Nine months later you are running an AI gateway. You just never named it, never drew it, and never decided on purpose where all those policies live.

Most explainers stop at 'a gateway is an API gateway for models. ' That framing is true and almost useless, because it hides the two things that actually make an AI gateway hard. So let me give you a sharper mental model to design against.

It is a money meter bolted to a blast-radius controller

A normal API gateway routes requests between services you own. The cost of a request is basically fixed and small, and if a downstream service is down, that one feature degrades. An AI gateway is different on both axes, and both differences are the whole point.

First, every request costs real, variable money, metered by tokens you cannot see until after the model has already spent them. A single buggy loop that re-prompts a long document can quietly turn into a four-figure invoice line before anyone notices. So the gateway is, before anything else, a money meter: the one place that can attribute spend to a team, cap it, and stop the bleeding.

Second, most teams end up funneling every AI feature through one small set of providers. That concentration means a single provider outage, or a single bad key, can take down search, support replies, summaries, and onboarding all at once. So the gateway is also a blast-radius controller: the one place that can fail over, shed load, and keep a provider incident from becoming a company incident.

If you cannot point at the file where your AI spend cap, your fallback order, and your PII policy actually live, you do not have those policies. You have hopes wearing a lanyard.

Hold onto that framing, because it tells you what belongs in the gateway (anything about money, safety, or what-happens-when-a-provider-dies) and what does not (business logic, which stays in your app). Most gateway messes come from putting the wrong things in the box.

A worked example: routing you can read

Concretely, a request should walk through the gateway as a short pipeline: authenticate the caller with a gateway key, check their quota, look for a cache hit, run any input guardrail, route to a provider, stream the response back while it counts tokens and latency, and fail over to a backup if the primary breaks. The value is that this is legible in one place. Here is a routing config for a support-summary feature that shows what that buys you.

yaml

routes:
  - name: support-summary
    match: { team: support, task: summarize }
    primary:  { provider: anthropic, model: claude-haiku }
    fallback: { provider: openai,    model: gpt-mini }
    # cache identical prompts for 1h; personalized routes set ttl 0
    cache:    { mode: exact, ttl_seconds: 3600 }
    guardrails: [redact_pii_input]
    limits:   { rpm: 120, monthly_usd: 400 }
    # gateway must give up BEFORE the caller does -- see the timeout trap
    timeout_ms: 8000

Read what changed. Swapping the default summary model for the whole company is a one-line edit here, not a twelve-service deploy. Turning on PII redaction for the support team never touches the support team's code. The cache TTL sits right next to the route it governs, so the policy is visible instead of buried in some helper three repos away. And the spend cap is a number a finance person could actually find. That legibility is the deliverable, not the routing itself.

Build, buy, or skip: an honest decision

You almost never need to write this from scratch. Open-source proxies like LiteLLM, gateways from Kong and Cloudflare, and hosted layers like Portkey already implement most of the pipeline. The honest decision has three branches, not two, and 'skip' is a real answer people forget.

Skip it (for now) if you have exactly one service calling one model with one key. A gateway adds a hop and an ops burden to solve problems you do not have yet. Add it the day you get your second caller, second key, or first spend surprise -- not before.
Adopt open source or buy if your needs are standard: a few providers, ordinary rate limits, basic logging and fallback. Self-hosting an open-source proxy keeps the extra hop local to your network. As a rough rule of thumb, a self-hosted proxy adds low-single-digit-to-low-tens-of-milliseconds per call, while a SaaS gateway also pays a network round-trip to the vendor -- fine for chat, potentially not for a latency-critical path.
Build a thin layer yourself only when the routing logic is genuinely yours: bespoke cost rules, domain-specific guardrails, or a private model fleet a generic proxy does not understand. Even then, wrap an existing proxy rather than reinventing streaming, retries, and token accounting.

One decision survives all three branches: provider keys live in the gateway and nowhere else. That single rule pays for itself the first time you have to rotate a leaked key, because it is a config change in one place instead of an archaeology dig across a dozen repos.

The failure modes that only show up at 2am

The generic advice ('do not make it a single point of failure') is correct but toothless. Here are the specific traps, in the order they tend to actually hurt teams, with what the quickstarts skip.

The timeout ordering trap. If your gateway's timeout is longer than or equal to the caller's timeout, failover is a lie. The caller gives up and retries while the gateway is still patiently waiting on the dead primary, so you pay for the slow attempt, never reach the fallback, and get a duplicate request on top. The gateway's total budget must be strictly shorter than the caller's, with room left for the fallback attempt inside it. Almost no quickstart mentions this, and it is the most common reason 'we have fallbacks' does not survive contact with a real outage.

Streaming versus failover. The clean 'retry on another provider' story quietly breaks once you stream tokens. The moment you have flushed the first token of a bad response to the user, you cannot transparently fail over -- they have already seen half an answer. Decide up front: buffer the first chunk so you can still switch (which adds a little to time-to-first-token), or accept that failover only protects requests that fail before streaming starts. There is no free version.

Semantic-cache poisoning. Semantic caching (matching near-identical prompts via embeddings, usually with a vector store like Redis or Qdrant) sounds like free money and is the sharpest edge in the box. Set the similarity threshold too loose and 'summarize this contract' returns a cached summary of a different contract, because the embeddings were close enough. Now you have served a confidently wrong, possibly cross-tenant answer, and it is cached, so you will serve it again. Exact-match caching fails safe; semantic caching fails silent. Use it only where a near-miss is harmless, and never share a semantic cache across tenants.

Caching anything personalized or sensitive by default. Set ttl 0 per route unless you can prove the response is identical for every caller.
Logging full prompts and responses with no redaction, then discovering months of customer PII sitting in your logs. Redact before you store, not before you show it to auditors.
Piling synchronous stages -- guardrail, moderation, embedding lookup -- into the hot path until the gateway itself is the latency. Measure the overhead you add, keep the hot path short, and push what you can to async.
Forgetting the gateway is now tier-0. It sits on the critical path of every AI feature you ship, so it needs the redundancy, alerting, and on-call story of your most important service -- not the monitoring of a side project.

None of these show up in a demo. They show up in month five, at 2am, which is exactly why they are worth designing against on day one.

The bottom line

An AI gateway is not a product you buy once and forget. It is where your organization's answers to the boring, decisive questions physically live: what we spend, what we log, what we allow, and what happens when a provider falls over. Build it on purpose and early, even if version one is fifty lines and a YAML file in front of a single model. The alternative is not 'no gateway. ' It is the accidental one -- undocumented, unmetered, keys everywhere -- that you will be reverse-engineering under an incident a year from now.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter