AI & ML

LLM Observability: Sample the Costly, Trace the Weird, Score the Rest

Most LLM observability advice tells you to "log everything. " That is how you get a $4,000 tracing bill and a dashboard nobody reads. Here is a sampling-first mental model, a worked incident, and the gotchas that only show up once real traffic hits your spans.

Dhileep KumarJun 11, 20267 min read

LLM Observability: Sample the Costly, Trace the Weird, Score the Rest

Every LLM observability post you have read ends with the same instruction: wrap your calls in spans and log the prompts and responses. It is correct, and it is where the interesting decisions actually begin. Because the moment you "log everything" against real traffic, you discover that prompts and responses are the single most expensive thing you can store, the most dangerous thing you can retain, and the least useful thing to look at in aggregate. The advice that sounds complete on a blog is the advice that generates a surprise invoice from your tracing vendor.

So this post takes a harder position than the genre usually allows: your observability system has its own cost, its own privacy blast radius, and its own failure modes, and the entire craft is deciding what NOT to keep. Metrics are nearly free and you should keep all of them. Payloads -- the actual prompt and response text -- are expensive and radioactive, and you should keep them deliberately. Get that one split right and everything else falls into place.

The mental model: two tiers, not three pillars

The classic "three pillars" framing (traces, metrics, logs) is a fine borrow from web observability, but it hides the decision that actually matters for LLMs. A more useful split is by cost of retention, and there are only two tiers.

Tier one is the numbers: tokens in, tokens out, cost per call, model name, time-to-first-token, latency, error and fallback flags, cache-hit rate. Each of these is a few dozen bytes. You can keep 100% of them for every call and barely notice. These are what your charts and alerts run on.

Tier two is the text: the full prompt (often several thousand tokens once you have stuffed in a system prompt, retrieved documents, and conversation history) and the full response. This is where the debuggability lives -- and where your storage bill, your PII exposure, and your compliance obligations all live too. Tier two is the tier you sample.

The skill in LLM observability is not capturing more. It is keeping every number and almost none of the text -- and being deliberate about which text you keep.

Why does this matter so much more for LLMs than for a normal web service? A REST endpoint's payload is a small JSON body you would rarely bother storing. An LLM call's payload is the interesting part AND it is huge AND it is user data. The three properties that make it worth capturing are the same three that make capturing all of it a mistake.

The instrumentation, and the sampler that makes it survivable

Here is the wrap-your-call snippet every post shows you. The difference is one parameter -- whether to keep the payload -- and the fact that the decision is made once, up front, before the call runs.

python

from contextlib import contextmanager
import time

# A minimal span. No vendor SDK needed to start.
# The point is not the library -- it is deciding WHAT to record
# and, crucially, WHETHER to keep the full payload.

@contextmanager
def llm_span(name, keep_payload):
    rec = {"name": name, "t_start": time.time()}
    try:
        yield rec
    finally:
        rec["duration_ms"] = int((time.time() - rec["t_start"]) * 1000)
        # ALWAYS emit the cheap numbers.
        emit_metric(rec)
        # Emit the expensive prompt/response text only when asked.
        if keep_payload:
            emit_payload(rec)

def handle(request):
    # Decide sampling ONCE, up front, so trace + payload agree.
    keep = should_keep_payload(request)   # see the four rules below
    with llm_span("chat.completion", keep) as rec:
        resp = client.chat(request.messages)
        rec["tokens_in"]  = resp.usage.prompt_tokens
        rec["tokens_out"] = resp.usage.completion_tokens
        rec["cost_usd"]   = price(resp.usage)
        rec["model"]      = resp.model
        rec["ttft_ms"]    = resp.ttft_ms
        if keep:
            rec["prompt"]   = redact(request.messages)
            rec["response"] = resp.text
        return resp

The interesting code is not the span; it is the sampler that feeds it. A naive "keep 5% at random" throws away exactly the traces you will wish you had, and keeps 5% of the boring ones. Bias the sampling toward the pathological instead: anything that retried or fell back, anything with a suspiciously large prompt, and a deterministic per-user slice so you can reconstruct whole sessions rather than orphaned single calls.

python

# The sampler is the whole design. Four rules, checked in order.
def should_keep_payload(request):
    # 1. Keep everything that already went wrong or looks wrong.
    if request.is_retry or request.fallback_triggered:
        return True
    # 2. Keep the expensive tail -- big prompts hide the cost bugs.
    if estimate_tokens(request) > 8000:
        return True
    # 3. Keep a deterministic slice per USER, not per call, so you
    #    can reconstruct a whole session instead of orphaned turns.
    if stable_hash(request.user_id) % 100 < 5:   # ~5% of users
        return True
    # 4. Otherwise: metrics only. Cheap, and enough for the charts.
    return False

The per-user hash in rule three is the non-obvious one. If you sample individual calls at random, a kept call is an island: you see the bad answer but not the three turns of context that produced it. Hashing on user id means that when you decide to keep someone, you keep their entire conversation -- which is the only thing that lets you actually replay what happened.

A worked incident: the bill that doubled with no deploy

Let me walk through the exact shape of failure this design is built to catch, because it is the one that trips teams up. Say your chat feature normally costs, as a rough illustration, about 1,200 tokens per request. One Monday the finance dashboard shows model spend has roughly doubled week over week. Nobody deployed anything. Support has not escalated. The feature works fine in the demo, as always.

With tier-one metrics on every call, the first question is answerable in one chart: plot tokens_in over time. You see it did not step up on a deploy boundary -- it drifted upward over four days. That rules out a code change and points at data. Now break tokens_in down by whatever your retrieval step feeds the prompt. One document source is contributing a growing slice.

This is where the sampler pays for itself. Rule two kept the full payload of every call with a prompt over 8,000 tokens -- exactly the tail that was growing. You open three of those retained payloads and see it immediately: a knowledge-base article was edited to paste in a giant changelog, your retriever is now pulling that whole wall of text into every relevant prompt, and it never errored because a bigger prompt is still a valid prompt. No 500. No alert in a traditional APM. Just a quiet, compounding cost regression that a token-trend chart caught and a biased payload sample explained.

Notice what you did NOT need: 100% payload retention. You needed the numbers on everything and the text on the expensive tail. That is the whole thesis in one incident.

When to reach for this -- and when it is overkill

Observability is not free, and treating it as universally mandatory is how you end up instrumenting a prototype that has eleven users. Match the investment to the stakes.

Full sampled tracing + payloads: any user-facing feature on real traffic, especially agents with tool calls where a single run fans out into a tree and the failure could be in any node. This is the default once you are past the prototype.
Metrics only, no payloads: internal tools, low-stakes batch jobs, or anywhere the prompts are your own and privacy is a non-issue. Keep the cheap numbers, skip the expensive text entirely.
Almost nothing: a prototype behind a feature flag with a handful of internal testers. A log line per call is fine. Do not build a pipeline for traffic you do not have yet.
Payload retention OFF by policy: regulated data (health, finance, minors). Here the default flips -- you keep metrics and a redacted, structural view, and treat any raw prompt retention as a decision that needs sign-off, not a default.

The trap is the middle: teams building heavyweight tracing for a demo, or shipping a real feature with nothing but request counts. Pick the row that matches your actual traffic and risk, not the one that looks most thorough.

Gotchas that only appear under real traffic

These are the things that do not show up in the tutorial, because the tutorial runs on ten requests. They show up on day two.

Your tracing latency counts against the user. If you emit a span synchronously and it does a network write before you return the response, you just added that write to time-to-first-token. Buffer and flush spans asynchronously -- observability must never sit on the hot path of the response.
Streaming breaks your token and latency numbers. When you stream tokens, the response object you would read usage from does not exist until the stream closes. You have to accumulate usage as chunks arrive, and record time-to-first-token at the first chunk, not at the end. A naive wrapper records zero tokens for every streamed call.
Redaction has to happen before the payload leaves your process, not in the dashboard. If you ship raw prompts to a third-party tracing vendor and redact in their UI, the raw PII already left your building. Redact in the span, at rule-evaluation time, so the unredacted text never crosses your network boundary.
Sampling on individual calls destroys agent traces. An agent run is a tree of calls; if your sampler flips a coin per call, you keep the root and drop two of its children, and the trace is unreadable. Decide keep or drop once at the top of the run and propagate that decision to every child span.
Cost per token is not one number. Cached input tokens, cache writes, and output tokens are often priced differently, and the ratio shifts as prompts grow. If your price() function multiplies total tokens by a single rate, your cost chart is confidently wrong in exactly the situation -- a bloated prompt -- where you most need it to be right.

Make it boring, on purpose

The goal was never a beautiful dashboard. It is the absence of surprises: the ability to answer "what is my AI doing right now, and what is it costing me? " from charts, and to answer "why did THIS go wrong? " from a sampled payload you had the foresight to keep. Evals told you the model was good before you shipped. This tells you it is still good now -- and it is the only one of the two running at 3 a. m. while you sleep.

So invert the usual advice. Do not log everything. Keep every number, keep the text you can justify, redact before it leaves, sample toward the weird, and decide once per run. Shipping an LLM feature with metrics-on-everything and a biased payload sample is not less observable than logging it all -- it is the version that survives contact with real traffic, a real invoice, and a real privacy review.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter