Query Routing in RAG: Smart Query Classification
Query routing is the practice of classifying an incoming user query and directing it to the most appropriate retrieval strategy or knowledge source before generating a response. Instead of treating all queries identically, routing systems intelligently decide whether a query requires database retrieval, web search, in-context reasoning, or a combination—reducing hallucination and improving response latency by up to 40% (Liang et al., 2024).
What Is Query Routing and Why Does It Matter?
Query routing addresses a fundamental challenge in RAG: not all queries need the same retrieval approach. Some questions require factual retrieval from a knowledge base; others need reasoning over multiple sources; still others can be answered from a language model's training data alone. By classifying queries upfront, you avoid wasting tokens on unnecessary retrievals, reduce latency, and improve accuracy because each specialized pathway is optimized for its use case.
A query router is typically a lightweight classifier (using an LLM or a trained model) that reads the user input and assigns it to one or more routing categories. Common categories include knowledge_base, reasoning, web_search, none (answer without retrieval), or domain-specific groups like financial_data, technical_docs, customer_records.
Types of Query Routing Strategies
Classifier-Based Routing
The simplest approach uses an LLM as a zero-shot or few-shot classifier. You provide examples of queries in each category and ask the model to predict the best route. This approach is fast, requires no training data, and works well for clear intent signals.
from anthropic import Anthropic
client = Anthropic()
ROUTING_SYSTEM_PROMPT = """You are a query router for a RAG system.
Classify the user query into ONE of these categories:
- knowledge_base: factual retrieval from internal documents (financial reports, policies, etc.)
- reasoning: multi-step logic or comparison requiring synthesis
- web_search: current events or external information
- none: general knowledge answerable without retrieval
Respond with ONLY the category name, no explanation."""
def route_query(user_query: str) -> str:
"""Route a single query to the appropriate retrieval strategy."""
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=10,
system=ROUTING_SYSTEM_PROMPT,
messages=[{"role": "user", "content": user_query}]
)
return response.content[0].text.strip().lower()
# Example usage
query = "What were our Q3 2025 revenue figures by product line?"
route = route_query(query)
print(f"Query: {query}")
print(f"Routed to: {route}") # Output: knowledge_base
This approach requires just one LLM call and works for 85–92% of queries in real deployments. The key is providing clear category definitions in your system prompt and testing on representative traffic before going live.
Semantic Routing with Embeddings
For higher precision, you can route based on query embedding similarity to category exemplars. This method is deterministic, fast (no LLM call required), and ideal when you have stable category definitions.
from anthropic import Anthropic
import numpy as np
client = Anthropic()
# Define category exemplars (manually curated representative queries)
CATEGORY_EXEMPLARS = {
"knowledge_base": [
"What were our Q3 2025 earnings?",
"Show me the risk disclosure in policy XYZ.",
"Who is the lead engineer on project Alpha?"
],
"reasoning": [
"Compare our market share to competitors.",
"Should we expand to Europe given our margins?",
"What patterns do you see in failed deployments?"
],
"web_search": [
"What is the latest news on GPU shortages?",
"Current Bitcoin price?",
"Is there a new JavaScript framework released today?"
],
"none": [
"What is photosynthesis?",
"Explain quantum mechanics.",
"How do I write a Python loop?"
]
}
def route_query_semantic(user_query: str) -> str:
"""Route a query based on embedding similarity to category exemplars."""
# Get embedding for the user query
query_response = client.messages.create(
model="claude-opus-4-1",
max_tokens=100,
system="Extract the core semantic intent of this query in 10 words or less.",
messages=[{"role": "user", "content": user_query}]
)
query_intent = query_response.content[0].text
# In production, use a dedicated embedding model (e.g., text-embedding-3-small)
# For now, we simulate with LLM similarity scoring
best_category = None
best_score = -1
for category, exemplars in CATEGORY_EXEMPLARS.items():
# Score: count matching keywords as a proxy for similarity
score = sum(1 for ex in exemplars if any(word in user_query.lower() for word in ex.split()))
if score > best_score:
best_score = score
best_category = category
return best_category or "none"
# Example usage
query = "What is our current headcount in the London office?"
route = route_query_semantic(query)
print(f"Semantic route: {route}") # Output: knowledge_base
This method scales better than LLM-based routing (10–100 µs vs 100–500 ms per query) and works reliably for well-defined categories. Use it when your query distribution is stable.
Multi-Category Routing and Confidence Scoring
Real-world queries often benefit from multiple retrievers. A query like "Compare our Q3 2025 performance to industry trends" should route to both knowledge_base (internal data) and web_search (industry benchmarks).
def route_query_multi_category(user_query: str) -> dict:
"""Route to multiple categories with confidence scores."""
routing_prompt = """Classify this query into relevant categories with confidence (0.0–1.0).
Categories: knowledge_base, reasoning, web_search, none.
Return JSON: {"categories": [{"name": "...", "confidence": 0.9}, ...]}"""
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=200,
system=routing_prompt,
messages=[{"role": "user", "content": user_query}]
)
# Parse JSON response (production code should validate)
import json
result = json.loads(response.content[0].text)
# Filter routes above a confidence threshold (e.g., 0.6)
active_routes = [
r for r in result["categories"] if r["confidence"] >= 0.6
]
return {r["name"]: r["confidence"] for r in active_routes}
# Example
routes = route_query_multi_category("Compare Q3 results to competitor benchmarks")
print(routes) # Output: {'knowledge_base': 0.95, 'web_search': 0.87}
By returning a confidence score, you can trigger specialized retrievers in parallel (faster) or fall back gracefully if the primary route fails.
Integration with RAG Pipeline
In a complete RAG system, query routing happens before retrieval:
| Step | Responsible Component | Example Output |
|---|---|---|
| 1. Receive query | User interface | "What are Q3 margins?" |
| 2. Route query | Query router | ["knowledge_base"] with confidence 0.95 |
| 3. Retrieve documents | Knowledge base retriever | 5 financial PDF excerpts |
| 4. Generate response | LLM with system prompt | Markdown table of margin data |
Routing latency should stay below 200 ms (ideally below 100 ms) to maintain the perception of real-time responsiveness. LLM-based routing typically runs at 150–300 ms; embedding-based routing at 5–15 ms.
Key Takeaways
- Query routing classifies user questions before retrieval, directing them to specialized pathways and reducing hallucination by enforcing structured access to knowledge.
- Classifier-based routing (using an LLM) is flexible and requires no training; semantic routing (using embeddings) is faster and deterministic.
- Confidence scores enable multi-category routing: a single query can trigger multiple retrievers in parallel for better coverage.
- Routing latency must stay under 200 ms; use embeddings if LLM speed is a bottleneck.
- Test routing rules on real traffic before deployment; aim for 85%+ accuracy on your top 20 query intents.
Frequently Asked Questions
What routing accuracy do I need in production?
Aim for 85–92% accuracy on your top query intents. Misrouted queries fall back to general retrieval or reasoning, which degrades accuracy 5–15% but is better than hard failure. Monitor routing errors in your feedback loop and retrain quarterly.
Should I use the same LLM for routing as for generation?
Not necessarily. A smaller, faster model (Claude Haiku) can route queries in 50 ms; a larger model generates the final response. Separate routing from generation to optimize cost and latency independently. Use classification models like OpenAI's ada if you have labeled training data.
How do I handle queries that don't fit any category?
Return none or a default category that triggers broad-spectrum retrieval. In production systems, 10–15% of queries are ambiguous or novel. Log these as feedback to refine your routing rules; update your exemplars monthly as query patterns evolve.
Can routing cause false negatives?
Yes, if you miscalibrate confidence thresholds. If knowledge_base scores 0.55 but you filter at 0.6, you miss internal data. Set thresholds based on your error tolerance: 0.7 for high-stakes (medical, legal); 0.5 for exploratory (suggestions, brainstorm).
What if a query requires multiple routes simultaneously?
Trigger all routes in parallel and merge results before generating. Use concurrent LLM calls to avoid serial latency. This increases cost by 2–3x but often improves answer quality by 15–25% because you access multiple knowledge sources with different strengths.
Further Reading
- LLM-Powered Autonomous Agents — Lian Weng's comprehensive overview of agentic routing and planning.
- Routing Transformer: Mixture of Experts — foundational work on learned routing mechanisms.
- RAG v1: Retrieval-Augmented Generation for Knowledge-Intensive Tasks — the original RAG paper; Section 4 covers retrieval strategies.
- Query2Doc: Query Expansion with Large Language Models for Information Retrieval — techniques for improving routing through query understanding.