Streaming AI Responses: SSE, WebSockets, and the Pitfalls

A streaming LLM response feels fast even when total generation takes ten seconds, because the user sees tokens arriving immediately. The trade is operational: streaming is a long-lived connection with backpressure, partial-failure modes, and a different shape from a normal HTTP request. Here is what you should know before shipping.
SSE or WebSockets

Two transports dominate streaming LLM responses:
data: ... \n\n). Auto-reconnect built in. The default for most LLM streaming.For a typical "user types a message, model streams a response" chat UX, use SSE. It is proxy-friendly, simpler to debug, and the protocol overhead is lower. Reach for WebSockets only when bidirectional matters. (Hivenet on streaming)
TTFT: the metric users feel

Time-to-first-token (TTFT) is the latency between request submission and the first token arriving. Users perceive TTFT as system responsiveness — total generation time matters less than how soon something starts appearing.
[Inference] Common targets:
Drivers of TTFT: prompt length (prefill cost), model size, GPU contention, network distance, and any server-side work between request receipt and first token emission.
A reported optimization: prefix caching in vLLM can reduce TTFT from ~800 ms to under 100 ms for requests that share a prefix with a previous request. (sysart) [Unverified]
Backpressure: the failure mode most teams miss

Streaming connections are long-lived. If the client falls behind — slow network, throttled tab, mobile background — the server's write buffer fills up. Without backpressure handling, your process leaks memory until something OOMs.
Node.js example (SSE over Express/Fastify):
const written = res.write(chunk);
if (!written) {
await new Promise(resolve => res.once('drain', resolve));
}The Node.js docs put it bluntly: "the golden rule of streams is to always respect backpressure; never call .write() after it returns false but wait for drain instead." (Node.js docs)
For WebSockets, watch ws.bufferedAmount. If it grows beyond a threshold (say, 64 KB), pause emission. Resume when it drains.
Client disconnect: the leakier failure mode
The browser tab closes. The user navigates away. The mobile network drops. Your server is mid-generation, mid-write. What happens?
If you do not handle it, a common chain in Python (FastAPI / Starlette):
ClientDisconnect or CancelledError somewhere in the generatorfinally: block runs — but if it does DB writes against the request-scoped session, the cancellation can leave a transaction "idle in transaction" on Postgres, holding row locks and exhausting the pool[Inference] This pattern is widespread enough to be a known anti-pattern. The fix is to do post-stream writes against a fresh session with asyncio.shield so they complete even if the caller is cancelled.
In Node.js, listen to req.on('close') and abort the upstream model call. In Python/FastAPI, check await request.is_disconnected() periodically and bail out if it returns true.
Error recovery mid-stream
Once the headers are flushed, you cannot send a normal HTTP error. Options:
event: error\ndata: {...} in SSE. Client switches to error UI on receipt.The approach that works best is to send a typed event protocol end-to-end: event: token, event: usage, event: error, event: done. The client handles each event type explicitly; ambiguous protocols cause flaky UIs.
Reconnect, resume, and idempotency
Streaming connections drop. Mobile. Wifi. Long-tail networks. Three patterns to handle this:
Last-Event-ID; the server replays events from that point. Works only if the server retains state (often via Redis or DB log).For voice agents and other very-long sessions, design for resume from the start. For simple chat UIs, "lost connection, please retry" is acceptable.
Server architecture: connections, not requests

A streaming server is sized in concurrent active connections, not requests per second. A 30-second-average generation at 100 RPS requires headroom for ~3,000 simultaneous connections. Default web-server connection limits often need to be raised; long-running async runtimes (uvicorn, fastify, starlette) handle this better than thread-per-request models.
Token rendering on the client
Watch out for bursty token dumps — frameworks that batch token events and flush every 200 ms make streams feel jerky. Render small chunks frequently. Some teams smooth this on the client by adding a tiny artificial delay between word-level renders for visual rhythm; opinions vary on whether it helps or feels patronizing.
A short checklist
drain in Node, throttle on bufferedAmount in WS, slow upstream consumption when needed.token, error, usage, done) — never overload data:.Bottom line
Streaming is a different shape than request/response. Most teams ship the happy path quickly and discover the failure modes — backpressure leaks, idle Postgres transactions, half-rendered messages — under load. Design for the failure cases up front, instrument TTFT, and treat client disconnect as a first-class event, not an exception.


