Skip to main content

Streaming LLM Responses: Reduce Latency

Token streaming is a UX technique that masks latency by showing the model's output in real-time as tokens are generated, rather than waiting for the entire response to complete before displaying anything. Streaming does not reduce actual latency (time-to-first-token and total generation time remain unchanged), but it dramatically reduces perceived latency because users see immediate progress. When streaming is combined with a response-time budget (e.g., users wait 2 seconds for a streamed response vs. 2 seconds for a buffered response), streaming makes the experience feel 3-4 times faster due to the illusion of progress.

Why Streaming Reduces Perceived Latency

Psychological research in human-computer interaction shows that perceived time correlates with progress visibility. When a system provides feedback (progress bar, incremental output), users perceive the same wait time as 50-70% shorter (Ling et al., 2005). Token streaming leverages this effect: instead of the user staring at a blank screen for 2 seconds, they see text appear incrementally, creating the sense that the system is highly responsive.

A buffered system: [2s wait] ... [response appears all at once]

A streaming system: [100ms first token] [tokens flow in] [response complete at 2.5s]

The streaming system feels faster despite taking slightly longer overall (server overhead) because the user sees immediate feedback. This is why every major LLM product (ChatGPT, Claude, Gemini) streams responses—not for latency reduction, but for UX.

HTTP Protocols for Streaming

Three HTTP mechanisms enable token streaming:

  1. Server-Sent Events (SSE): Unidirectional streaming from server to client. The server sends a single HTTP response that does not close; instead, it streams events (formatted as data: ... lines) indefinitely. Simple, HTTP/1.1 compatible, ideal for LLM responses.

  2. WebSocket: Bidirectional streaming. Requires an additional connection upgrade, but supports full-duplex communication. Useful for chat systems with user interrupts.

  3. Chunked Transfer Encoding: Raw HTTP chunked responses. Lower-level, less structured than SSE, but works in all browsers. Rarely used for LLMs today.

Server-Sent Events is the standard for LLM streaming due to simplicity and browser support. A typical SSE response looks like:

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"token": "Hello"}

data: {"token": " there"}

data: {"token": ", how"}

data: {"token": " are"}

data: {"token": " you"}

The client parses each data: line as a JSON event and appends the token to the UI. The connection stays open until the model finishes generating and the server sends a final [DONE] event.

Implementing Token Streaming with Python + FastAPI

Below is a production-ready example of a streaming LLM endpoint using FastAPI and vLLM:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from vllm import LLM, SamplingParams
import asyncio
import json

app = FastAPI()

# Load LLM once at startup
llm = LLM(model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9)

async def generate_tokens(prompt: str, max_tokens: int = 256):
"""
Generator that yields tokens as they are produced.
Integrates with vLLM's streaming output.
"""
sampling_params = SamplingParams(max_tokens=max_tokens)

# vLLM supports use_tqdm=False for quiet streaming
results = llm.generate(
prompts=[prompt],
sampling_params=sampling_params,
use_tqdm=False
)

# In production, use vLLM's AsyncLLM for truly async generation
# For this example, we block in a thread pool
for output in results:
for token_id in output.outputs[0].token_ids:
token_text = llm.get_tokenizer().decode(
[token_id], skip_special_tokens=True
)
yield f"data: {json.dumps({'token': token_text})}\n\n"

yield "data: [DONE]\n\n"

@app.post("/v1/completions/stream")
async def stream_completion(request: dict):
"""
Endpoint that streams tokens via Server-Sent Events.
Request format: {"prompt": "...", "max_tokens": 256}
"""
prompt = request.get("prompt", "")
max_tokens = request.get("max_tokens", 256)

return StreamingResponse(
generate_tokens(prompt, max_tokens),
media_type="text/event-stream"
)

This endpoint accepts a POST request and immediately returns an open SSE stream. The client (JavaScript, curl, or any SSE-compatible client) consumes tokens as they arrive:

<script>
const eventSource = new EventSource("/v1/completions/stream?prompt=Hello%20world");
let responseText = "";

eventSource.addEventListener("message", (event) => {
const data = JSON.parse(event.data);
if (data.token) {
responseText += data.token;
document.getElementById("response").innerHTML = responseText;
}
});

eventSource.addEventListener("error", () => {
console.log("Stream complete or error.");
eventSource.close();
});
</script>

Streaming Latency Trade-offs

While streaming improves perceived latency, it introduces real tradeoffs:

  1. Buffering overhead: Streaming requires buffering tokens on the client and rendering each one to the DOM. For very fast responses (tokens arriving faster than 50ms apart), this can cause UI jank. Debounce token updates to every 50-100ms for smooth rendering.

  2. Network chattiness: Each token sent as a separate HTTP message, even if only a few bytes. This increases HTTP overhead. Batch tokens (send 5 at a time) to reduce packet count.

  3. Server resource holding: The HTTP connection stays open until the response completes, holding a socket and thread/async task on the server. Careless scaling can exhaust server file descriptors. Use non-blocking async I/O (FastAPI's async generator, not threads).

  4. Connection interruptions: Users may close the browser tab or lose network connection mid-stream. The server must detect and clean up orphaned streams. Implement a reasonable timeout (e.g., if a client does not consume tokens for 30s, close the connection).

Streaming Best Practices

  • Set a token send interval: Do not send every token individually. Buffer 2-5 tokens and send together every 50-100ms for optimal client-side rendering performance and reduced network overhead.
  • Handle disconnects gracefully: If the client closes the connection, the server should stop generating tokens. Most async frameworks detect this automatically, but verify with asyncio.CancelledError.
  • Measure perceived latency, not actual latency: Use client-side instrumentation to record when the first token appears in the UI and when the user sees the complete response. These metrics matter more than server-side latency.
  • Combine with TTFT optimization: Streaming masks latency only after the first token. If TTFT is 2 seconds, streaming still feels slow. Use streaming + TTFT optimization (tensor parallelism, caching) together.

Streaming Implementation Checklist

StepImplementationNotes
Choose protocolServer-Sent EventsSimpler than WebSocket; widely supported
Implement generatorAsync function that yields tokensUse AsyncLLM in vLLM for true async
Return StreamingResponseFastAPI's StreamingResponse(generator, media_type="text/event-stream")Sets correct HTTP headers
Format eventsdata: {JSON}\n\nEach event is two newlines
Client parsingEventSource API or manual fetch with streamingHandles connection lifecycle
Buffer tokens on clientSend 2-5 tokens per DOM updateReduces jank; debounce to 50-100ms
Handle disconnectsServer detects closed sockets; client detects error eventsGraceful cleanup

Key Takeaways

  • Streaming masks latency, not reduces it: Perceived latency improves dramatically, but actual latency is unchanged or slightly worse (buffering overhead).
  • Server-Sent Events is the standard: Simple, HTTP/1.1 compatible, ideal for LLMs.
  • Combine streaming with TTFT optimization: Streaming alone cannot fix a 2-second TTFT; you still need prefill optimization.
  • Batch tokens on the client: Sending every token individually causes jank; debounce to 50-100ms.
  • Monitor connection lifecycle: Implement timeouts and graceful disconnect handling to avoid resource leaks.

Frequently Asked Questions

Does streaming reduce actual latency or just perceived latency?

Streaming reduces perceived latency (time until first visible output) but does not reduce actual latency (time until all tokens are generated). Total generation time may even increase slightly due to buffering overhead. Streaming is a UX technique, not a performance optimization.

Can I use streaming with request batching?

Yes, but with caveats. If you batch five requests together, their tokens are interleaved in the decode phase. You must demultiplex the tokens back to the correct client stream. This is complex; most production systems use a dedicated streaming scheduler (vLLM includes one). Do not try to batch requests if you also want to stream—the complexity is not worth the throughput gain.

What latency happens before streaming starts?

Streaming begins after the prefill phase completes and the first token is generated. Until then, users see nothing. This is why TTFT optimization is essential; streaming cannot hide a 1-second prefill latency.

How do I handle user interrupts in streaming?

WebSocket is better than SSE for bidirectional communication. When a user clicks "stop," the client sends a message to the server, which stops token generation and closes the connection. SSE is unidirectional, so you cannot interrupt from the client; users must close the tab.

Further Reading