LLM Observability: Tracing What Your AI Does in Production
You shipped the LLM feature, the demo worked, and now it’s a black box serving real users. Observability is how you see what your AI is actually doing — before a silent quality drop becomes a support queue.
There’s a particular dread that arrives the week after you ship an AI feature. The demo worked. The evals passed. And now thousands of real users are sending prompts you never imagined, the model is responding in ways you can’t see, and your only signal that something’s wrong is the support tickets starting to pile up. The feature became a black box the moment it left your laptop.
Observability is how you open that box. If evals are the tests you run before shipping, observability is the monitoring you run after — the traces, metrics, and logs that tell you what your AI is actually doing in production, in real time. It’s a different discipline from evaluation, and in 2026 it’s no longer optional for anything serving real traffic.
Why LLM apps need their own observability
Traditional monitoring assumes failures are loud: a 500 error, a crash, a spiked latency graph. LLM apps break quietly, in ways your APM dashboard was never built to catch.
- Failures are silent. The model returns a fluent, confident, wrong answer with a 200 OK. Nothing errors; the quality just degrades where no graph is looking.
- Cost is per-token and sneaky. A prompt that quietly grew, or a retry loop, can multiply your bill overnight. You need to see tokens and dollars per call, not just request counts.
- Latency is variable and felt. Users notice a slow model. Time-to-first-token and total generation time matter in ways a single latency number hides.
- Behavior drifts underneath you. A model version, a prompt tweak, a changed data source — any can move quality without a deploy. You only catch it if you’re watching.
The three pillars: traces, metrics, logs
Borrow the structure from classic observability and bend it to LLMs. The same three pillars apply, with AI-specific contents.
- Traces and spans. An agent run is a tree — a model call, a tool call, a retrieval, another model call. Tracing captures that tree so you can see the whole chain, not just the final answer.
- Metrics. Tokens in and out, cost per call, time-to-first-token, error rate, cache-hit rate. The numbers you chart and alert on.
- Logs. The actual prompts and responses (carefully — they contain user data). When a metric looks wrong, the log is where you see what really happened.
- Feedback. Thumbs up and down, corrections, regenerations. Real user signal about quality that no automatic metric fully captures.
Instrumenting a call
You don’t need a vendor to begin — you need to wrap each model call in a span that records what went in, what came out, and what it cost. Tools like OpenTelemetry, LangSmith, and Langfuse standardize this, but the idea is simple:
# Wrap every model call in a span that captures the essentials.
import time
def traced_completion(prompt, model):
start = time.time()
resp = model.complete(prompt) # the actual LLM call
record_span({
"model": model.name,
"prompt_tokens": resp.usage.input,
"output_tokens": resp.usage.output,
"cost_usd": estimate_cost(resp.usage, model.name),
"latency_ms": int((time.time() - start) * 1000),
"prompt": prompt, # redact PII before storing
"response": resp.text,
})
return respDo that for every call — and every tool invocation inside an agent — and the black box becomes a glass one. You can replay any user’s session, see which step blew the token budget, and find the prompt that produced the bad answer instead of guessing.
Evals tell you the model was good before you shipped. Observability tells you it’s still good now. You need both, and only one of them runs at 3 a. m. while you’re asleep.
What to watch
- Cost per request and per user. Catch the runaway prompt or retry loop before it shows up on the invoice.
- Token usage trends. A slow creep in prompt size is the most common silent cost regression.
- Latency, especially time-to-first-token. It’s what users actually feel, long before total generation time.
- Error and fallback rates. Rate limits, timeouts, and guardrail blocks tell you where the system is straining.
- Quality signals. Sample and score live outputs, and watch user feedback — the only way to catch a silent quality drop.
Make it boring
The goal of observability isn’t a beautiful dashboard; it’s the absence of surprises. You want the unglamorous ability to answer “what is my AI doing right now, and what is it costing me? ” without opening a single user’s session by hand. Instrument first, chart second, alert on the few things that actually page you.
Shipping an LLM feature without observability is flying a plane with the instruments taped over — fine until the weather turns. Wrap your calls in spans, watch tokens and latency and quality, and you trade that week-after dread for something better: the quiet confidence of actually being able to see.
Enjoyed this?
Get the next deep dive in your inbox. No spam — just the stories worth reading.
Subscribe to the newsletter