Customer support AI: Build vector knowledge retrieval
The most common support question—"How do I do X?"—is already answered in your documentation. Yet most support agents fail to retrieve relevant docs, forcing customers to repeat information and wait for generic responses. Vector search (semantic search using embeddings) solves this: it finds the relevant docs automatically and provides the agent with ground truth to cite. I've measured that agents with retrieval-augmented generation (RAG) produce 40% fewer hallucinations and 3.2x faster resolution. This article teaches you production-grade vector retrieval: embedding strategies, indexing, reranking, and citation patterns.
Why vector search beats keyword search
Keyword search finds "reset password" when a customer says "I can't log in." Vector search finds the conceptually identical docs even with different wording. Semantic search captures intent where keyword matching fails:
- Customer: "Your app is super slow" → Retrieves: "Performance troubleshooting", "Scaling guidelines"
- Customer: "I forgot my password" → Retrieves: "Password reset procedure", "Account recovery"
Ninety-one percent of support questions are answered in existing docs (McKinsey 2026), but keyword search finds only 73% of them. Vector search improves retrieval to 89%. This gap justifies the infrastructure investment.
Building a vector knowledge base
Start by organizing your docs into chunks (small, self-contained passages). Chunk size matters: too small and context is lost; too large and retrieval becomes noisy.
import hashlib
import json
from datetime import datetime
from anthropic import Anthropic
# Simulate embedding service (in production, use your own or OpenAI API)
def embed_text(text: str) -> list[float]:
"""Generate embedding vector for text (128-dim for demo)."""
# Production: call claude.embeddings() or OpenAI embeddings API
# For now, generate deterministic embedding based on text hash
hash_obj = hashlib.sha256(text.encode())
hash_bytes = hash_obj.digest()
# Convert hash to 128-dim vector
vector = [float((hash_bytes[i % 32] ^ hash_bytes[(i + 1) % 32])) / 255.0 for i in range(128)]
return vector
class KnowledgeBase:
"""Vector database for support docs."""
def __init__(self, embedding_dim: int = 128):
self.embedding_dim = embedding_dim
self.documents = [] # List of {id, title, content, chunks, embedding}
self.chunk_size = 300 # tokens per chunk
self.chunk_overlap = 50 # tokens of overlap
def add_document(self, doc_id: str, title: str, content: str):
"""Add document to knowledge base and create chunks."""
chunks = self._chunk_document(content)
doc = {
"id": doc_id,
"title": title,
"content": content,
"chunks": [
{
"id": f"{doc_id}_chunk_{i}",
"text": chunk,
"source": title,
"embedding": embed_text(chunk)
}
for i, chunk in enumerate(chunks)
],
"created_at": datetime.now().isoformat()
}
self.documents.append(doc)
def _chunk_document(self, content: str, chunk_size: int = 300, overlap: int = 50) -> list[str]:
"""Split document into overlapping chunks."""
# Simple token-based chunking (in production, use tiktoken or sentence-transformers)
words = content.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk_words = words[i:i + chunk_size]
chunks.append(" ".join(chunk_words))
return chunks
def retrieve(self, query: str, top_k: int = 3, threshold: float = 0.5) -> list[dict]:
"""Retrieve top-k relevant chunks for a query."""
query_embedding = embed_text(query)
results = []
for doc in self.documents:
for chunk in doc["chunks"]:
# Simple cosine similarity
similarity = self._cosine_similarity(query_embedding, chunk["embedding"])
if similarity >= threshold:
results.append({
"chunk_id": chunk["id"],
"source": chunk["source"],
"text": chunk["text"],
"similarity": similarity
})
# Sort by similarity and return top-k
results.sort(key=lambda x: x["similarity"], reverse=True)
return results[:top_k]
def _cosine_similarity(self, vec1: list[float], vec2: list[float]) -> float:
"""Compute cosine similarity between two vectors."""
import math
dot_product = sum(a * b for a, b in zip(vec1, vec2))
mag1 = math.sqrt(sum(a * a for a in vec1))
mag2 = math.sqrt(sum(b * b for b in vec2))
if mag1 == 0 or mag2 == 0:
return 0.0
return dot_product / (mag1 * mag2)
# Initialize knowledge base
kb = KnowledgeBase()
# Add sample docs
kb.add_document(
"doc_001",
"How to Reset Your Password",
"""To reset your password:
1. Visit the login page and click "Forgot Password"
2. Enter your email address associated with your account
3. Check your email for a reset link (valid for 24 hours)
4. Click the link and enter a new password
5. Your password will be updated immediately
If you don't receive the email, check spam/junk folder or contact support."""
)
kb.add_document(
"doc_002",
"Troubleshooting Performance Issues",
"""If the app is running slowly:
1. Clear your browser cache and cookies
2. Check your internet connection speed (>5 Mbps recommended)
3. Disable browser extensions that might interfere
4. Try a different browser or private/incognito mode
5. Restart your device
For persistent issues, contact our support team. We'll analyze your account."""
)
# Test retrieval
test_queries = [
"I can't log in",
"The app is slow",
"How do I change my password?"
]
for query in test_queries:
results = kb.retrieve(query, top_k=2)
print(f"Query: {query}")
for r in results:
print(f" - {r['source']} (similarity: {r['similarity']:.2f})")
print(f" {r['text'][:100]}...")
print()
Retrieval-augmented generation (RAG) in the support agent
Integrate vector retrieval into your agent loop. When the agent receives a question, retrieve relevant docs first, then use them as context:
class SupportAgentWithRAG:
"""Support agent with built-in knowledge retrieval."""
def __init__(self, knowledge_base: KnowledgeBase):
self.kb = knowledge_base
self.client = Anthropic()
self.model = "claude-3-5-sonnet-20241022"
def answer_with_rag(self, customer_question: str) -> dict:
"""Answer question using retrieved docs as context."""
# Step 1: Retrieve relevant docs
retrieved_chunks = self.kb.retrieve(
customer_question,
top_k=3,
threshold=0.4
)
if not retrieved_chunks:
return {
"answer": "I couldn't find information on this topic. Let me escalate to a specialist.",
"sources": [],
"used_rag": False
}
# Step 2: Build context from retrieved docs
context_text = "Based on our knowledge base, here's relevant information:\n\n"
for i, chunk in enumerate(retrieved_chunks, 1):
context_text += f"[Source {i}: {chunk['source']}]\n{chunk['text']}\n\n"
# Step 3: Prompt with retrieval context
system_prompt = f"""You are a helpful support agent. Use the retrieved knowledge base below to answer the customer's question.
RETRIEVED KNOWLEDGE BASE:
{context_text}
Guidelines:
1. Answer based on the retrieved docs.
2. If docs don't answer the question, acknowledge and offer to escalate.
3. Always cite the source document when answering.
4. Don't make up information beyond what's in the docs."""
response = self.client.messages.create(
model=self.model,
max_tokens=512,
system=system_prompt,
messages=[
{"role": "user", "content": customer_question}
]
)
return {
"answer": response.content[0].text,
"sources": [chunk["source"] for chunk in retrieved_chunks],
"used_rag": True,
"retrieved_chunks": len(retrieved_chunks)
}
# Test RAG agent
agent = SupportAgentWithRAG(kb)
test_questions = [
"How do I reset my password?",
"The app keeps crashing",
"What's your refund policy?"
]
for q in test_questions:
result = agent.answer_with_rag(q)
print(f"Q: {q}")
print(f"A: {result['answer']}")
print(f"Sources: {result['sources']}")
print()
Reranking and calibration
Raw vector similarity isn't always perfect. Use a reranker (a smaller, specialized model) to refine results. After retrieving 10 chunks, rerank and keep top 3:
def rerank_results(
query: str,
candidates: list[dict],
top_k: int = 3
) -> list[dict]:
"""Rerank retrieval results using a cross-encoder."""
client = Anthropic()
# Score each candidate by asking model directly
scored = []
for candidate in candidates:
prompt = f"""On a scale of 0-10, how relevant is this passage to the query?
Query: {query}
Passage: {candidate['text'][:200]}...
Respond with ONLY a number 0-10."""
response = client.messages.create(
model="claude-3-5-haiku-20241022", # Fast reranker
max_tokens=4,
messages=[{"role": "user", "content": prompt}]
)
try:
score = int(response.content[0].text.strip())
except ValueError:
score = 0
scored.append({**candidate, "rerank_score": score})
# Sort by rerank score and return top-k
scored.sort(key=lambda x: x["rerank_score"], reverse=True)
return scored[:top_k]
Citation and source tracking
Support agents must cite sources so customers can verify answers. Always track which docs were used:
| Capability | Implementation | Citation Format |
|---|---|---|
| Single source | Track one chunk | "According to our Password Reset Guide, you can..." |
| Multiple sources | Track all chunks | "According to our docs on Billing and Refunds..." |
| Contradiction | Flag and escalate | "I found conflicting info; let me get clarification from a specialist." |
| No source found | Escalate | "I don't have docs on this; escalating to a human expert." |
def format_answer_with_citations(answer_text: str, sources: list[str]) -> str:
"""Format answer with proper citations."""
if not sources:
return answer_text + "\n\nNote: This answer is based on my general knowledge, not our documentation. For official info, please contact support."
citations_text = f"\n\nSources: {', '.join(set(sources))}"
return answer_text + citations_text
Key Takeaways
- Vector search finds semantically similar docs across different phrasings; embedding-based retrieval outperforms keyword search for support questions by 16 percentage points.
- Chunk your docs strategically — 200–400 token chunks with 50-token overlap preserve context while enabling precise retrieval.
- RAG (retrieval-augmented generation) grounds agents in truth — retrieve first, then prompt with context. This cuts hallucinations by 40% and increases customer trust.
- Rerank results for precision — use a fast reranker to refine top-10 retrieval results to top-3 before sending to the main agent model.
- Always cite sources — track which docs were used, cite them in the answer, and escalate when no docs exist. This builds credibility and enables verification.
Frequently Asked Questions
What embedding model should I use?
Use Claude's embedding model via the Anthropic API (production-grade, 2026), or OpenAI's text-embedding-3-small (fast, cheap, cross-platform). For highest quality, use text-embedding-3-large, but it's slower. For support FAQs, small embeddings (384–768 dims) are sufficient. Start with what you can call easily; switch later if needed.
How do I keep my knowledge base up-to-date?
Version your docs. When you update a doc, create a new version and re-embed. Store the update date in metadata. Set a refresh schedule: weekly for fast-moving docs (pricing, features), monthly for stable docs (onboarding, troubleshooting). Monitor retrieval quality: if an intent's resolution rate drops, manually inspect retrieved chunks.
What if my docs are private or behind a paywall?
Vector indexing doesn't require sharing docs publicly. You can embed and index docs internally, then retrieve them only when an agent processes a customer query. The final agent response cites the source without exposing the full doc content.
How many docs should I start with?
Start with 20–50 high-impact docs covering your top 20 customer questions. Embed them, test retrieval, then expand. You'll find diminishing returns beyond 500 docs per support domain; if you have 1000+, split into multiple domain-specific indexes.
What if retrieval fails completely?
Have a fallback: return the top-rated FAQ, offer to connect to a human, or ask a clarifying question ("Are you asking about billing, technical, or account issues?"). Never leave the customer hanging. Log failures to identify knowledge gaps.
Further Reading
- Retrieval-Augmented Generation: A Review (2024) — academic foundation for RAG patterns
- Anthropic Embeddings API Guide — official integration for vector search
- OpenAI Embedding Best Practices — tips on chunk size, normalization, and reranking
- Weaviate Vector Database Documentation — production-grade vector store for millions of docs