All posts
Software

Deploying LLM Apps on GKE, Step by Step

There’s a wide, quiet gap between an LLM app that works on your laptop and one that survives real users on Kubernetes. GKE closes a lot of it — but only if you know which parts it solves and which it leaves to you.

Dhileep Kumar8 min read
Deploying LLM Apps on GKE, Step by Step

There’s a wide, quiet gap between an LLM app that works on your laptop and one that survives contact with real users on Kubernetes. The demo is a single process with your API key in an environment variable. Production is replicas, autoscaling, secrets, health checks, GPU scheduling, and a bill that can quietly triple overnight. Google Kubernetes Engine handles a lot of that gap — but only if you know which parts it does and doesn’t solve for you.

GKE gives you a managed control plane, autoscaling node pools, GPU support, and the usual Kubernetes primitives. What it doesn’t give you is judgment about how an LLM workload differs from a normal web service — and that difference is where most first deployments go wrong.

First, decide what you’re actually deploying

Before you touch a manifest, answer one question, because it changes everything downstream: are you calling a hosted model, or serving your own weights? These are two completely different deployments that happen to share a cluster.

  • Calling a hosted model (OpenAI, Anthropic, Vertex AI): your service is stateless and CPU-only. It’s basically a web app that makes outbound HTTPS calls — easy to scale, cheap to run, and the hard parts are timeouts and rate limits, not hardware.
  • Self-hosting open weights (Llama, Mistral, and friends): now you need GPU nodes, gigabytes of model files, long warm-up times, and careful memory math. This is the expensive, interesting path.
  • Most teams start with the first and only move to the second for cost, privacy, or latency reasons — not for fun.
  • You can mix both in one cluster: a stateless gateway calling hosted models, plus a GPU-backed service for the one model you self-host.

The cluster setup that matters

For the self-hosted path, the node pool is the decision that dominates your bill and your reliability. You want GPUs on their own pool, separate from your CPU workloads, with autoscaling that can go to zero when nothing is running. Keeping a GPU node idle overnight is the single most common way teams set money on fire.

bash
# A GPU node pool that scales to zero when idle.
gcloud container node-pools create gpu-pool \
  --cluster llm-cluster \
  --region us-central1 \
  --machine-type g2-standard-8 \
  --accelerator type=nvidia-l4,count=1 \
  --enable-autoscaling --min-nodes 0 --max-nodes 4 \
  --node-taints nvidia.com/gpu=present:NoSchedule

# Install the NVIDIA drivers via Google's managed DaemonSet.
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Containerize the inference service

The container itself is mostly normal, with two LLM-specific wrinkles: the image is big because the runtime and dependencies are heavy, and the model weights should almost never live inside the image. Bake the code, mount or download the weights at startup, and your deploys stay fast while your image stays sane.

yaml
apiVersion: apps/v1
kind: Deployment
metadata: { name: llm-inference }
spec:
  replicas: 1
  selector: { matchLabels: { app: llm-inference } }
  template:
    metadata: { labels: { app: llm-inference } }
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
      containers:
        - name: server
          image: us-docker.pkg.dev/my-proj/llm/server:1.4.0
          resources:
            limits:
              nvidia.com/gpu: 1
          ports:
            - containerPort: 8080
          readinessProbe:          # don't send traffic until weights are loaded
            httpGet: { path: /healthz, port: 8080 }
            initialDelaySeconds: 120
            periodSeconds: 10

The readiness probe is the line that earns its keep. Loading a multi-gigabyte model can take a minute or two; without a probe that waits for it, Kubernetes routes traffic to a pod that isn’t ready and your users see errors during every scale-up. The generous initial delay isn’t laziness — it’s the difference between a clean rollout and a flapping one.

The things that bite in production

  • Cold starts. A new GPU pod can take minutes to schedule a node, pull the image, and load weights. Autoscaling that assumes seconds will fall behind under a spike.
  • Scaling on the wrong metric. CPU usage tells you almost nothing about an LLM service. Scale on requests in flight, queue depth, or GPU utilization instead.
  • GPU cost. An idle accelerator still bills by the hour. Scale-to-zero and aggressive consolidation are not optional if you want to stay solvent.
  • Timeouts everywhere. Long generations outlast default ingress and load-balancer timeouts; raise them deliberately or watch requests get cut off mid-stream.
  • Weights in the image. A 10GB image makes every deploy crawl and every scale-up slower. Pull weights at startup from a bucket or a persistent volume instead.

Kubernetes doesn’t make a large language model cheap or fast. It makes it manageable — repeatable deploys, sane rollbacks, and autoscaling you can reason about. Confusing those two is how the first GKE bill becomes a meeting with finance.

Scaling without lighting money on fire

  1. Start on hosted models and a CPU-only deployment. Only move to self-hosted GPUs when a real number — cost, latency, or privacy — forces it.
  2. Put GPUs in their own autoscaling pool with a minimum of zero, so idle time costs nothing.
  3. Autoscale on a metric that reflects load — in-flight requests or queue length — not CPU.
  4. Cache aggressively and reuse embeddings; the cheapest inference is the one you don’t run.
  5. Watch cost per request as a first-class dashboard, next to latency and error rate. If you can’t see it, you can’t control it.

Deploying an LLM app on GKE isn’t really a Kubernetes problem; it’s a discipline problem wearing a Kubernetes costume. The platform will happily run a wasteful, fragile service exactly as faithfully as a tight one. The teams that do this well aren’t the ones with the cleverest manifests — they’re the ones who decided early what they were deploying, scaled on the right signal, and treated the GPU bill as a number to defend rather than discover.

Share

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter

Comments