Building a Chat Interface for Your RAG System
A RAG system is invisible without a user interface. The chat interface is where users interact with your knowledge base: typing queries, reading answers, asking follow-ups, and accessing citations. Building an effective chat UI requires attention to streaming responses (showing text as it arrives), conversation context management (remembering previous queries), citation display (linking answers to sources), and error handling (gracefully handling retrieval failures). This article covers the full stack: backend message handling, frontend streaming, and UX patterns for knowledge-base chat.
The Chat Architecture: Backend to Frontend
A typical RAG chat system has three components:
- Backend API: Accepts messages, runs retrieval and generation, returns streamed responses.
- Frontend UI: Renders messages, displays citations, handles user input.
- State Management: Tracks conversation history, manages context for follow-up queries.
Here is a minimal backend API using FastAPI:
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
import asyncio
import json
from openai import AsyncOpenAI
app = FastAPI()
client = AsyncOpenAI(api_key="YOUR_API_KEY")
class ConversationState:
"""In-memory conversation history (use database for production)."""
def __init__(self):
self.messages = []
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
def get_context(self, max_turns: int = 5):
"""Return last N turns for context."""
return self.messages[-max_turns*2:]
conversations = {} # In production, use session/JWT token as key
async def retrieve_and_stream(
query: str,
conversation_id: str,
retriever
) -> AsyncIterator[str]:
"""Retrieve documents and stream generation."""
# Step 1: Retrieve relevant documents
retrieved_docs = retriever.search(query, k=5)
context = "\n\n".join([doc["text"] for doc in retrieved_docs])
# Step 2: Build prompt with conversation history
conversation = conversations.get(conversation_id, ConversationState())
history_context = conversation.get_context()
# Step 3: Stream response from LLM
system_prompt = """You are a helpful assistant answering questions from a knowledge base.
Always cite your sources using [1], [2], etc., corresponding to the numbered documents below.
Keep responses concise (under 200 words) unless asked for more detail."""
messages = [
{"role": "system", "content": system_prompt},
*history_context, # Previous conversation turns
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
# Stream the response
async with client.messages.stream(
model="gpt-4o-mini",
max_tokens=500,
messages=messages,
temperature=0.2
) as stream:
async for text in stream.text_stream:
yield text # Stream individual chunks to client
await asyncio.sleep(0.01) # Small delay for UI responsiveness
# Extract and return citations metadata
citations = [
{
"index": i+1,
"source": doc["source"],
"url": doc.get("url", ""),
"snippet": doc["text"][:150]
}
for i, doc in enumerate(retrieved_docs)
]
yield f"\n\n<!-- CITATIONS: {json.dumps(citations)} -->"
@app.post("/chat")
async def chat(message: dict):
"""Handle chat messages with streaming."""
query = message["message"]
conversation_id = message.get("conversation_id", "default")
# Store user message
conversation = conversations.get(conversation_id, ConversationState())
conversation.add_message("user", query)
conversations[conversation_id] = conversation
return StreamingResponse(
retrieve_and_stream(query, conversation_id, retriever=your_retriever),
media_type="text/plain"
)
@app.get("/conversation/{conversation_id}")
async def get_conversation(conversation_id: str):
"""Retrieve full conversation history."""
conversation = conversations.get(conversation_id)
if not conversation:
raise HTTPException(status_code=404, detail="Conversation not found")
return {"messages": conversation.messages}
Frontend: React Chat Component
Here is a React component that displays the chat and streams responses:
import React, { useState, useRef, useEffect } from 'react';
export default function RAGChat() {
const [messages, setMessages] = useState([]);
const [input, setInput] = useState("");
const [isLoading, setIsLoading] = useState(false);
const [citations, setCitations] = useState([]);
const messagesEndRef = useRef(null);
const scrollToBottom = () => {
messagesEndRef.current?.scrollIntoView({ behavior: "smooth" });
};
useEffect(() => {
scrollToBottom();
}, [messages]);
const handleSendMessage = async () => {
if (!input.trim()) return;
// Add user message to UI immediately
const userMessage = { role: "user", content: input };
setMessages(prev => [...prev, userMessage]);
setInput("");
setIsLoading(true);
try {
const response = await fetch("/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
message: input,
conversation_id: "default"
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let assistantMessage = { role: "assistant", content: "" };
let fullResponse = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
// Check for citations metadata (sent at end)
if (chunk.includes("<!-- CITATIONS:")) {
const citationMatch = chunk.match(/<!-- CITATIONS: (.*) -->/);
if (citationMatch) {
setCitations(JSON.parse(citationMatch[1]));
}
} else {
fullResponse += chunk;
assistantMessage.content = fullResponse;
// Update UI with streamed content
setMessages(prev => {
const updated = [...prev];
if (updated[updated.length - 1].role === "assistant") {
updated[updated.length - 1].content = fullResponse;
} else {
updated.push(assistantMessage);
}
return updated;
});
}
}
} catch (error) {
console.error("Error:", error);
setMessages(prev => [...prev, {
role: "assistant",
content: "Sorry, an error occurred. Please try again."
}]);
} finally {
setIsLoading(false);
}
};
return (
<div style={{ maxWidth: "800px", margin: "0 auto", height: "100vh", display: "flex", flexDirection: "column" }}>
{/* Messages */}
<div style={{ flex: 1, overflowY: "auto", padding: "16px", backgroundColor: "#f5f5f5" }}>
{messages.map((msg, idx) => (
<div
key={idx}
style={{
marginBottom: "12px",
padding: "12px",
borderRadius: "8px",
backgroundColor: msg.role === "user" ? "#007bff" : "#e9ecef",
color: msg.role === "user" ? "white" : "black",
textAlign: msg.role === "user" ? "right" : "left"
}}
>
<p>{msg.content}</p>
{/* Show citations for assistant messages */}
{msg.role === "assistant" && citations.length > 0 && (
<div style={{ marginTop: "12px", fontSize: "0.85em", borderTop: "1px solid #ccc", paddingTop: "8px" }}>
<strong>Sources:</strong>
{citations.map(cite => (
<div key={cite.index} style={{ marginTop: "4px" }}>
<a href={cite.url} target="_blank" rel="noopener noreferrer">
[{cite.index}] {cite.source}
</a>
<p style={{ margin: "4px 0", fontSize: "0.9em", color: "#666" }}>
{cite.snippet}...
</p>
</div>
))}
</div>
)}
</div>
))}
{isLoading && (
<div style={{ padding: "12px", textAlign: "center", color: "#999" }}>
Thinking...
</div>
)}
<div ref={messagesEndRef} />
</div>
{/* Input */}
<div style={{ padding: "16px", borderTop: "1px solid #ddd", backgroundColor: "white" }}>
<div style={{ display: "flex", gap: "8px" }}>
<input
type="text"
value={input}
onChange={e => setInput(e.target.value)}
onKeyPress={e => e.key === "Enter" && handleSendMessage()}
placeholder="Ask a question..."
style={{
flex: 1,
padding: "10px",
borderRadius: "4px",
border: "1px solid #ddd"
}}
/>
<button
onClick={handleSendMessage}
disabled={isLoading || !input.trim()}
style={{
padding: "10px 20px",
borderRadius: "4px",
backgroundColor: "#007bff",
color: "white",
border: "none",
cursor: "pointer"
}}
>
Send
</button>
</div>
</div>
</div>
);
}
Context Management and Follow-Ups
A critical feature is handling follow-up questions that reference previous messages:
def build_context_aware_prompt(
current_query: str,
conversation_history: list[dict],
retrieved_docs: list[dict]
) -> list[dict]:
"""Build a prompt that includes conversation context."""
# Summarize previous turns to save tokens
context_summary = None
if len(conversation_history) > 2:
# Summarize last conversation turn
prev_turn = conversation_history[-2]
context_summary = f"Previous question: {prev_turn['user']}\nPrevious answer: {prev_turn['assistant'][:200]}...\n"
# Build messages list
messages = [
{
"role": "system",
"content": "You are a helpful assistant. Remember previous questions and answers in this conversation."
}
]
if context_summary:
messages.append({"role": "system", "content": context_summary})
# Add retrieved context
context_text = "\n\n".join([doc["text"] for doc in retrieved_docs])
messages.append({
"role": "user",
"content": f"Context:\n{context_text}\n\nQuestion: {current_query}"
})
return messages
Error Handling and User Feedback
Users need to understand what went wrong:
async def chat_with_fallback(message: dict):
"""Chat with graceful error handling."""
try:
return StreamingResponse(
retrieve_and_stream(message["message"], "default", retriever),
media_type="text/plain"
)
except ValueError as e:
# Retrieval error
return {
"error": "No relevant documents found for your query.",
"suggestion": "Try rephrasing your question or check if the knowledge base contains relevant content."
}
except Exception as e:
# Unexpected error
return {
"error": "An unexpected error occurred.",
"detail": str(e) # Log this for debugging
}
Key Takeaways
- Build a chat backend that streams responses to the client for real-time feedback.
- Manage conversation history in the backend to enable context-aware follow-ups.
- Display citations inline with answers, allowing users to verify sources.
- Handle errors gracefully, giving users actionable feedback.
- Use streaming to improve perceived performance; users see text arriving in real time.
Frequently Asked Questions
Should I store full conversation history or summarize?
For short conversations (under 10 turns), store everything. For longer conversations, summarize older turns to save tokens and LLM context. A hybrid approach: store full history, but only send the last 5 turns plus a summary of earlier turns to the LLM.
How do I handle multiple users without sharing context?
Use session tokens or user IDs to partition conversations. Store conversations in a database (PostgreSQL, Redis) keyed by user ID. On login, retrieve the user's conversation history; on logout, persist it.
Can I use a different streaming protocol than HTTP?
Yes. WebSocket is ideal for low-latency bidirectional communication. Use python-socketio (backend) and Socket.IO (frontend) for real-time chat. HTTP Server-Sent Events (SSE) is simpler and sufficient for most use cases.
How do I handle conversation turns that time out?
Set a timeout on your API (e.g., 30s). If retrieval or LLM generation exceeds it, return a partial response or error message. For slow queries, show a loading state and suggest the user refine their question.
Should citations be clickable links or just text?
Clickable links are better UX. Link each citation to the source document's URL. For internal documents, link to a preview or detail page. For external sources (web pages), link directly with target="_blank".
Further Reading
- Building a Production Chat Application — Anthropic's guide to chat UX.
- Streaming HTTP Responses — MDN guide to fetch streaming.
- WebSocket for Real-Time Chat — Socket.IO documentation.
- LangChain Chat Memory — LangChain's conversation memory implementations.