Software

Deploying LLM Apps on GKE: The Bill Is the Design Document

Most GKE-for-LLM guides hand you manifests. None of them tell you the one number that should drive every decision: cost per thousand tokens served. Here's a mental model that starts from the bill and works backward — including the two mistakes that turn a $300/month deploy into a $4,000 one.

Dhileep KumarJun 9, 20268 min read

Deploying LLM Apps on GKE: The Bill Is the Design Document

Search "deploy LLM on GKE" and you get a dozen near-identical walkthroughs: create a cluster, add a GPU node pool, apply a Deployment, expose a Service, done. They are all technically correct and all miss the point. The manifest was never the hard part. The hard part is that a Kubernetes cluster will run a wasteful, fragile inference service exactly as faithfully as a tight one — and it will bill you for the difference every hour, whether or not a single user shows up.

So this post inverts the usual order. Instead of starting from YAML and hoping the economics work out, we start from one number and let it dictate the architecture. That number is cost per thousand tokens served — measured against your real traffic, not the GPU's spec sheet. Once you anchor on it, most of the "which node pool, which autoscaler, which timeout" questions answer themselves.

The mental model: a GPU is a taxi with the meter always running

Here is the framing that fixes most first-deploy mistakes. A CPU web service is like electricity — you pay roughly for what you use, and idle costs round to nothing. A GPU node is a taxi you've hired: the meter runs the entire time the car is at the curb, whether you're riding or the car is parked outside your house overnight. The instance is billed by the second of wall-clock uptime, not by the second of actual token generation.

That single asymmetry explains almost everything downstream. Scale-to-zero matters because a parked taxi is pure waste. Cold starts matter because getting a new taxi takes minutes, not seconds. Batching matters because one taxi carrying four passengers is four times cheaper per head. Once you internalize "the meter is always running," you stop optimizing for peak throughput and start optimizing for meter-off time and passengers-per-trip. Those are different goals, and the generic guides optimize for neither.

First decision: are you even renting a taxi?

Before any of this applies, split your deployment into the only two categories that matter. This is the fork every honest GKE-for-LLM guide should open with, because the two paths share almost nothing except the word "cluster. "

Calling a hosted model (Anthropic, OpenAI, Vertex AI): your pod is stateless and CPU-only. It is a normal web app that happens to make outbound HTTPS calls. There is no taxi — you're paying the model provider per token and paying Google cents for CPU. The hard problems are timeouts, retries, and rate-limit backoff, not hardware.
Self-hosting open weights (Llama, Mistral, Qwen, and friends): now you're renting taxis. GPU nodes, multi-gigabyte weight files, minute-long warm-ups, and VRAM arithmetic that has to be right or the pod OOMs on the first long prompt.

The non-obvious part: these can and often should coexist in one cluster. A stateless gateway Deployment fans out to hosted models for the 90% of traffic that's cheap and easy, while a single GPU-backed Deployment serves the one fine-tuned model you actually need to own for cost, latency, or data-residency reasons. Trying to self-host everything on day one is the most expensive rookie move there is.

Deploying an LLM on GKE isn't a Kubernetes problem wearing a Kubernetes costume. It's a capacity-planning problem, and the cluster is just the place the invoice gets generated.

A worked example: pricing the deploy before you write YAML

Let's walk a realistic scenario end to end, because the arithmetic is where the intuition lives. Say you're self-hosting a 7B-class model to serve an internal support assistant. Illustrative, round numbers only — plug in your own before you commit anything.

Suppose one GPU node rents for roughly $3/hour and your model, batched, can serve about 20 requests per second at an average of 500 output tokens each. If you pin one node running 24/7 "to be safe," that's about $2,160/month — and here's the trap: an internal tool is idle maybe 16 hours a day. You're paying full fare for a taxi parked in the garage two-thirds of the time. Cut it to a 10-hour workday with scale-to-zero overnight and weekends off, and the same node drops toward roughly $650/month. Same code, same manifest sans one autoscaling setting, roughly a third of the bill.

Now the second lever: passengers per trip. If your traffic actually arrives one request at a time because you never enabled continuous batching in the serving runtime, that $3/hour buys you a fraction of the throughput. The GPU sits 80% idle mid-trip. The fix isn't a bigger node — it's a serving engine (vLLM, TGI, or similar) that packs concurrent requests into one forward pass. Same taxi, four passengers, one-quarter the cost per answer.

Notice what just happened: the two biggest cost levers — meter-off time and passengers-per-trip — are both configuration, not hardware. That's the whole thesis. You defended the bill before you ever argued about GPU models.

The node pool and probe settings that actually move the needle

For the self-hosted path, two config choices dominate reliability and cost far more than anything else in your manifest: a dedicated, scale-to-zero GPU node pool, and a readiness probe honest about model load time. Here's a node pool created to go to zero, and a probe that refuses traffic until the weights are actually resident in VRAM.

bash

# Dedicated GPU pool that parks the taxi (min-nodes 0) when nothing runs.
# Taint it so ONLY inference pods land here and CPU workloads never wake a GPU node.
gcloud container node-pools create gpu-inference \
  --cluster my-llm-cluster \
  --machine-type g2-standard-8 \
  --accelerator type=nvidia-l4,count=1 \
  --enable-autoscaling --min-nodes 0 --max-nodes 4 \
  --node-taints nvidia.com/gpu=present:NoSchedule \
  --spot   # spot/preemptible = big discount; only if your service tolerates eviction

The readiness probe is the single line that earns its keep. Loading a multi-gigabyte model into VRAM can take 60 to 120 seconds. Without a probe that waits for it, Kubernetes marks the pod Ready the instant the process starts, routes live traffic to it, and every user during a scale-up gets a connection to a model that hasn't finished loading. The generous initialDelaySeconds below isn't laziness — it's the difference between a clean rollout and a flapping one.

yaml

readinessProbe:
  httpGet:
    path: /health   # this endpoint must return 200 ONLY after weights are in VRAM
    port: 8080
  # Do NOT let K8s send traffic during the 60-120s weight load.
  initialDelaySeconds: 90
  periodSeconds: 10
  failureThreshold: 3
startupProbe:
  # Guards slow cold starts: gives up to 5 min before the pod is declared failed.
  httpGet: { path: /health, port: 8080 }
  periodSeconds: 10
  failureThreshold: 30

The pairing matters: the startupProbe covers the long, one-time weight load so the liveness probe doesn't kill a healthy-but-slow pod mid-load, while the readinessProbe gates traffic. Most single-probe examples online conflate these and produce pods that either get killed while loading or receive traffic before they're ready.

What the docs don't tell you: the five that bite in production

Autoscale on CPU and you've already lost. CPU utilization is meaningless for a GPU-bound service — the GPU can be pegged at 100% while the CPU naps. Scale on requests-in-flight, queue depth, or GPU utilization via a custom or external metric. HPA on CPU will happily under-provision through a spike and over-provision when you're idle.
Cold starts are measured in minutes, not seconds. A scale-from-zero event chains: provision a node (1-3 min), pull a multi-GB image, then load weights (1-2 min). If your autoscaler assumes seconds, it falls behind every spike. Keep a warm minimum of one during business hours if latency SLOs are tight; the taxi-in-the-garage math still says park it overnight.
Timeouts are silently capped upstream. A 90-second generation will be severed at 30 seconds by a default GKE Ingress / load-balancer backend timeout, and you'll blame the model. Raise backend and ingress timeouts deliberately, and set them consistently across every hop, or streamed responses die mid-sentence.
Weights baked into the image tax every deploy. A 12 GB image makes every rollout and every scale-up crawl while the node drags it over the network. Bake only code; pull weights at startup from a bucket, a persistent volume, or GCS FUSE. Your image stays under a gigabyte and deploys stay fast.
Spot GPU eviction is a feature, not a failure. Spot nodes are dramatically cheaper but can be reclaimed with ~30 seconds notice. Great for batch and async work, quietly catastrophic for a user-facing chat if you have no on-demand fallback pool. Decide which workload can eat an eviction before you enable spot, not after.

None of these are in the quickstart because the quickstart's job is to get you a green pod, not a defensible invoice. They surface on week three, usually as a Slack message from whoever owns the cloud budget.

The one dashboard that changes behavior

If you take one operational habit from this: put cost per thousand tokens served on a dashboard, next to p95 latency and error rate, and look at it daily. Latency and errors tell you if the service is healthy. Cost-per-token tells you if it's sane. A deploy can be fast, reliable, and quietly hemorrhaging money because a node never scaled to zero or batching silently regressed after a config change — and none of your health dashboards will say a word.

When to reach for self-hosted GPUs on GKE at all: only when a real number forces it. Hosted models with a CPU-only Deployment is the correct default, and you graduate to owning weights when per-token cost at your volume, tail latency, or a data-residency requirement makes the taxi cheaper than the ride. When NOT to: a spiky, low-volume, or early-stage product where you'll spend more engineering hours defending the GPU bill than you'd ever pay a hosted API. Kubernetes makes an LLM manageable — repeatable deploys, sane rollbacks, autoscaling you can reason about. It never makes one cheap. That part is still on you, and the bill is where you'll find out whether you did it.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter