Streaming AI Responses: Real-Time UX
Users hate waiting for AI responses. A 5-second delay feels like an eternity when a chatbot is thinking. Streaming responses token-by-token creates the illusion of instant feedback: the user sees the first token within 100 ms, and words appear continuously as the LLM generates them. This article shows how to architect streaming responses using Server-Sent Events (SSE) or WebSockets, implement proper error handling for mid-stream failures, and handle buffering on both client and server.
What is Streaming in AI Responses?
Streaming is the practice of sending LLM output as a sequence of text chunks (tokens) rather than waiting for the complete response and sending it all at once. Instead of a user waiting 3 seconds for the full 300-word response to arrive, the first word appears in 100 ms, the first sentence by 500 ms, and the full response by 3 seconds. The user perceives the AI is "thinking with them" rather than disappearing into a black box. Streaming is supported by all major LLM providers (OpenAI, Anthropic, Google, Cohere) via their streaming APIs.
Server-Sent Events (SSE): Unidirectional Streaming
SSE is a simple, HTTP-based protocol where the server pushes text messages to the client over a persistent connection. Use SSE when the client sends one request and waits for the streamed response.
Backend Implementation
# FastAPI server: stream LLM response via SSE
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import anthropic
import json
app = FastAPI()
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
@app.post("/api/v1/stream-completion")
async def stream_completion(request: Request, prompt: str):
"""Stream LLM response as Server-Sent Events."""
# Verify tenant context
if not request.state.tenant_id:
return {"error": "Unauthorized"}
async def generate_stream():
"""Generator that yields SSE-formatted chunks."""
try:
# Use the streaming API from Anthropic
with client.messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
) as stream:
# Each delta is a token
for text in stream.text_stream:
# SSE format: data: <json>\n\n
chunk = json.dumps({
"type": "content_block_delta",
"delta": {"type": "text_delta", "text": text}
})
yield f"data: {chunk}\n\n"
# Send completion signal
final_event = json.dumps({"type": "message_stop"})
yield f"data: {final_event}\n\n"
except Exception as e:
# Send error without closing connection
error_event = json.dumps({
"type": "error",
"error": str(e)
})
yield f"data: {error_event}\n\n"
return StreamingResponse(
generate_stream(),
media_type="text/event-stream"
)
Frontend Implementation (JavaScript)
// Client: listen to SSE stream and update UI in real-time
async function streamCompletion(prompt) {
const responseDiv = document.getElementById("response");
responseDiv.textContent = ""; // Clear previous response
try {
const response = await fetch("/api/v1/stream-completion", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${localStorage.getItem("auth_token")}`
},
body: JSON.stringify({ prompt })
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
// Each chunk is "data: {...}\n\n"
const lines = chunk.split("\n");
for (const line of lines) {
if (line.startsWith("data: ")) {
try {
const event = JSON.parse(line.slice(6)); // Remove "data: " prefix
if (event.type === "content_block_delta") {
// Append token to response
responseDiv.textContent += event.delta.text;
} else if (event.type === "error") {
responseDiv.style.color = "red";
responseDiv.textContent += `\n\nError: ${event.error}`;
} else if (event.type === "message_stop") {
console.log("Stream completed");
}
} catch (e) {
console.error("Failed to parse event:", e);
}
}
}
}
} catch (error) {
console.error("Stream error:", error);
responseDiv.textContent = `Error: ${error.message}`;
}
}
WebSockets for Bidirectional Streaming
For chat applications where the user can interrupt, edit, or send follow-up messages while the AI is still responding, WebSockets are better than SSE because they allow two-way communication over a single connection.
# FastAPI WebSocket: bidirectional chat streaming
from fastapi import WebSocket, WebSocketDisconnect
import json
import asyncio
@app.websocket("/ws/chat/{user_id}")
async def websocket_chat(websocket: WebSocket, user_id: str):
"""Bidirectional chat with streaming responses."""
await websocket.accept()
try:
while True:
# Receive message from client
data = await websocket.receive_text()
message = json.loads(data)
# Verify tenant context (extract from WebSocket session)
tenant_id = get_tenant_from_websocket(websocket)
if message["type"] == "user_message":
prompt = message["content"]
# Start streaming response
async with stream_llm_response(prompt) as stream:
async for token in stream:
# Send token to client
await websocket.send_json({
"type": "token",
"content": token
})
# Signal completion
await websocket.send_json({"type": "done"})
except WebSocketDisconnect:
# Clean up: cancel any in-flight LLM requests
print(f"Client {user_id} disconnected")
except Exception as e:
await websocket.send_json({
"type": "error",
"error": str(e)
})
await websocket.close(code=1011)
Streaming Strategy Comparison
| Approach | Best For | Complexity | Latency | Error Recovery |
|---|---|---|---|---|
| SSE (Server-Sent Events) | Request-response streams | Low | <100 ms first token | Client reconnects and re-sends |
| WebSocket | Bidirectional chat | Medium | <100 ms first token | Server retains connection context |
| Long-polling | Legacy browsers, no WebSocket | High | 100+ ms per poll | Simple but inefficient |
Error Handling in Streams
Mid-stream errors are tricky: the client has already received partial data. Never close the connection silently.
# Server: graceful error handling in stream
async def generate_stream_with_recovery():
"""Stream with error recovery and fallback."""
try:
# Start streaming
for token in llm_stream:
yield token
except RateLimitError:
# Send error token and suggest retry
yield json.dumps({
"type": "error",
"error": "Rate limited. Please try again in 60 seconds.",
"recoverable": True
})
except Exception as e:
# Log error for debugging
logger.error(f"Stream error: {e}", exc_info=True)
# Send error but keep connection open (partial response is better than nothing)
yield json.dumps({
"type": "error",
"error": "LLM service error. Your partial response is above.",
"recoverable": False
})
Key Takeaways
- Use SSE for simple request-response streams; use WebSockets for interactive chat where the client needs two-way communication.
- Stream tokens as soon as they arrive from the LLM; do not buffer or batch tokens, as this increases perceived latency.
- Handle mid-stream errors gracefully: send an error token instead of closing the connection so the client sees partial output.
- On the client, update the DOM incrementally as tokens arrive; do not wait for the complete response before rendering.
- Add a timeout for streaming requests (e.g., 5 minutes) to prevent orphaned connections from consuming server resources.
Frequently Asked Questions
How do I handle user interruption while the LLM is streaming?
Send a cancellation message from the client (e.g., {"type": "cancel"}). On the server, catch the cancel message and halt the stream generator via asyncio.CancelledError or a context manager. Inform the LLM provider to stop generation if they support it (e.g., OpenAI does not; Anthropic Streaming API will stop if you close the connection).
Should I buffer tokens for throughput, or send each token immediately?
Send each token immediately. Buffering introduces latency (users see the response slower) and complexity (you need to decide buffer size). The network layer already buffers; let it do its job.
What happens if the user closes their browser tab mid-stream?
The server will receive a connection close event. Use WebSocketDisconnect or check for broken pipe errors to detect this. Clean up any in-flight LLM requests and log the cancellation for analytics.
How do I prevent a slow client from causing server backpressure?
Implement a client-side receive buffer or rate limiting. If the client is slow to consume data, tell the server to pause. Most streaming libraries handle this automatically; if not, yield data only when the output buffer has space.