AI & ML

Streaming LLM Responses Is a UX Trick, Not a Speed-Up — Here's Where That Bites You

Everyone tells you to stream token-by-token because it feels fast. True — but streaming quietly turns one atomic HTTP response into a distributed system with partial failures, half-parsed JSON, and cost you can't read until the last chunk. Here's a mental model, a worked before/after, and a table of when NOT to bother.

Dhileep KumarJun 12, 20267 min read

Streaming LLM Responses Is a UX Trick, Not a Speed-Up — Here's Where That Bites You

Every post about streaming LLM output makes the same promise: pipe the tokens to the browser as they arrive and your slow app suddenly feels fast. That part is true, and I'm not going to re-litigate it. What those posts skip is the bill that comes due afterward. The moment you turn on streaming, you have swapped a single atomic HTTP response for a long-lived, stateful, partially-successful stream — and almost every hard bug I've reasoned my way through in AI apps lives in that swap, not in the happy path.

So this is the version of the streaming article I wish I'd read first. The mental model, one concrete before/after, a decision table for when streaming is actively the wrong call, and the specific places it breaks once real infrastructure sits between your model and your user.

The mental model: you traded a transaction for a broadcast

A normal API call is a transaction. You send a request, you wait, you get back one payload that either succeeded or failed. There is exactly one moment where things can go wrong, and one object to reason about. Your error handling is a single try/catch and your UI has two states: loading, then done.

Streaming is not that. Streaming is a broadcast that has already started playing. By the time the connection drops, you have already handed the user 200 words. You cannot un-send them. There is no single success/failure boolean — there is a timeline, and the failure can land anywhere on it. This one shift is the source of nearly every streaming gotcha below. Hold onto it: the token-by-token UX is the reward; the timeline-with-no-undo is the cost.

Non-streaming asks 'did the request succeed? ' Streaming asks 'how far did we get before it didn't? ' — and your UI has to be able to answer the second question.

A worked example: the summarizer that looked done but wasn't

Let me walk a concrete scenario, because the abstract version hides the trap. Say you build a document summarizer. Non-streaming, the flow is trivial: POST the document, await the response, render the summary. If the model call throws, you show an error toast and the user retries. Clean.

Now you stream it because a 400-word summary took nine seconds and the spinner felt broken. New flow: open the stream, append each chunk to a div. It demos beautifully. Then it ships, and a support ticket says a user 'got a summary that stopped in the middle and lied to them. ' Here's what happened: the model streamed three of five bullet points, the upstream connection hiccuped, and your client — which only knew how to append text — left three bullets on screen with no error, no spinner, nothing. To the user, three bullets that render cleanly ARE the summary. The app didn't crash. It confidently showed a truncated answer as if it were complete.

That is the whole lesson in miniature. The non-streaming version fails loud (an error, or nothing). The streaming version fails silent (a plausible-looking partial). Fixing it isn't about catching the error — it's about the client tracking a completion signal, so a stream that ends without the model's natural stop reason gets visibly marked as interrupted rather than passed off as finished.

To make that concrete in code: the naive server loop that every tutorial shows looks like this — pull tokens, push tokens, done:

javascript

// Naive: forwards tokens, forgets the ending
async function stream(req, res) {
  res.setHeader('Content-Type', 'text/event-stream');
  const completion = await model.chat({ messages, stream: true });
  for await (const chunk of completion) {
    const text = chunk.choices[0]?.delta?.content || '';
    res.write('data: ' + JSON.stringify({ text }) + '\n\n');
  }
  res.end(); // ends the same way whether it finished or died mid-token
}

The problem is the last line. res. end() fires identically whether the model reached its natural stop or the loop threw halfway through. The client has no way to tell a complete answer from an amputated one. Here's the version that carries the ending as data, not just as a closed socket:

javascript

// Better: the ending is a first-class event
async function stream(req, res) {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('X-Accel-Buffering', 'no'); // tell nginx to stop buffering
  let finishReason = null;
  try {
    const completion = await model.chat({ messages, stream: true });
    for await (const chunk of completion) {
      const delta = chunk.choices[0];
      if (delta?.delta?.content) {
        res.write('data: ' + JSON.stringify({ text: delta.delta.content }) + '\n\n');
      }
      if (delta?.finish_reason) finishReason = delta.finish_reason;
    }
    res.write('event: done\ndata: ' + JSON.stringify({ finishReason }) + '\n\n');
  } catch (err) {
    // tokens already sent can't be recalled — signal the break explicitly
    res.write('event: error\ndata: ' + JSON.stringify({ partial: true }) + '\n\n');
  } finally {
    res.end();
  }
}

The diff is small — a try/catch, a tracked finish_reason, an explicit done vs error event — but it converts a silent partial into a visible one. That's the single highest-leverage change you can make to a streaming endpoint, and almost no tutorial includes it.

When to stream, when not to, and what breaks

Streaming is the default for text a human reads in real time. It is not a universal good. Here's the decision framework I actually use:

STREAM — chat, long-form answers, anything a person reads as it appears. Time-to-first-token is the metric users feel; streaming crushes it. This is the case the whole internet already sold you on.
DON'T STREAM — the output feeds another program, not an eyeball. If a downstream service consumes the result, streaming buys nothing and costs you partial-parse headaches. Wait for the whole thing.
DON'T STREAM — short outputs under ~2 seconds. If time-to-completion is already near time-to-first-token, streaming adds protocol and failure surface for a shrug of perceived gain. A yes/no classifier does not need SSE.
BE CAREFUL — structured JSON the UI acts on. Streaming half-formed JSON means the client sees invalid objects for most of the stream. Either stream plain text, or use a tolerant incremental parser and only act on fields once they're provably complete.
RECONSIDER — anything behind a gateway you don't control. Some proxies, CDNs, and serverless platforms buffer the whole response before flushing, silently defeating streaming. If you can't disable buffering, you're paying streaming's complexity for none of its benefit.

The gotchas nobody's tutorial mentions

These are the ones that follow from the transaction-to-broadcast shift, not from any specific SDK. They're worth internalizing because they'll bite you on infrastructure you didn't write.

Buffering is the silent killer. A reverse proxy like nginx will happily accumulate your entire response and flush it in one lump — which looks exactly like streaming being 'broken' with no error anywhere. The tell: tokens arrive in one clump instead of a trickle. The fix lives outside your code (disable proxy buffering, send the X-Accel-Buffering header, avoid gzip on the stream). If your tokens clump, suspect the network path before you suspect your loop.

Cancellation has to reach all the way back. When a user hits stop, closing the browser connection feels like enough. It isn't. Unless you propagate that cancel to the upstream model call, the model keeps generating — and you keep paying — into a socket nobody is reading. Wire an abort signal from the client's disconnect through to the provider SDK's cancel path, and verify it, because this failure is invisible until the invoice arrives.

Cost and usage arrive last, not first. Token counts and billing metadata come in the final chunk of the stream. Any metering, rate-limiting, or budget-guard logic that assumed it could read usage up front has to be restructured to wait for the end — and to handle the case where the stream died before that final chunk ever arrived, leaving you with tokens spent and no usage record.

Retries are not free anymore. With a transaction, a failed call is retried cleanly — nothing was shown. With a stream, you've already painted 200 words on the screen. 'Retry' now means deciding between restarting from scratch (jarring — the text the user was reading vanishes and rewrites) or resuming (usually impossible, because the model has no memory of the exact tokens it already emitted). Most teams pick restart-from-scratch and just accept the flicker; the point is that it's a product decision streaming forces on you, and one the non-streaming version never had to make.

So: stream by default, but budget for the tail

The standard advice is right: for anything a person reads, ship streaming first and treat the non-streaming version as the degraded fallback. Time-to-first-token is the number users actually feel, and streaming is close to free at runtime — same model, same tokens, same bill, dramatically better perceived speed. I'm not walking any of that back.

What I'm adding is that the plumbing everyone calls trivial — 'just a generator on the server and an append on the client' — is trivial only for the happy path. The real work is the tail: the interrupted stream that renders as a confident partial, the proxy that buffers your tokens into a lump, the cancel that never reaches the model, the usage record that arrives last or never. Budget for that tail up front and streaming stays the superpower it's advertised as. Skip it, and you've shipped an app that feels fast right up until the first time it quietly lies to a user.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter