Production Agentic RAG: Enterprise Architecture
Production agentic RAG systems elevate RAG from a simple retrieval-generation pipeline to a fully-autonomous agent that can access multiple tools (search, databases, APIs), reason about when to use each, and adapt based on feedback. Agentic RAG integrates query routing, multi-hop retrieval, self-grading, fallback mechanisms, and external tools into a robust system capable of handling complex, real-world queries at enterprise scale. These systems reduce hallucination by 50–60% and enable users to trust AI for critical decisions.
The Agentic RAG Architecture
An agentic RAG system has these core components:
- Query understanding: Parse intent, constraints, and context.
- Tool registry: Search APIs, knowledge bases, databases, calculators, web browsers.
- Planning agent: Decide which tools to invoke and in what order.
- Execution engine: Execute tool calls with proper error handling and fallback.
- Refinement loop: Grade outputs, re-plan if needed, adapt to feedback.
- Response synthesis: Combine results into a coherent natural language answer.
Unlike static pipelines, agentic systems dynamically decide what to do based on the query and intermediate results.
Designing the Tool Registry
Define all available tools that the agent can use:
from anthropic import Anthropic
from typing import Callable, Any
import json
client = Anthropic()
class Tool:
"""Represents a tool the RAG agent can invoke."""
def __init__(self, name: str, description: str, parameters: dict,
fn: Callable):
self.name = name
self.description = description
self.parameters = parameters # JSON Schema
self.fn = fn
def invoke(self, **kwargs) -> str:
"""Execute the tool with the given parameters."""
try:
result = self.fn(**kwargs)
return json.dumps(result)
except Exception as e:
return json.dumps({"error": str(e)})
# Define tools
def search_knowledge_base(query: str, top_k: int = 5) -> dict:
"""Search the company knowledge base."""
return {
"results": [
{"title": "Document 1", "score": 0.92, "snippet": "..."},
{"title": "Document 2", "score": 0.85, "snippet": "..."}
]
}
def query_database(sql: str) -> dict:
"""Execute a SQL query on the data warehouse."""
return {
"rows": [{"revenue": 100000, "region": "US"}],
"execution_time_ms": 245
}
def calculate(expression: str) -> dict:
"""Evaluate a mathematical expression."""
return {"result": eval(expression)}
def web_search(query: str) -> dict:
"""Search the web for current information."""
return {
"results": [
{"url": "https://...", "title": "...", "snippet": "..."}
]
}
# Build tool registry
tools = [
Tool(
name="search_kb",
description="Search the company knowledge base for documents, policies, and FAQs",
parameters={
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"top_k": {"type": "integer", "description": "Number of results"}
},
"required": ["query"]
},
fn=search_knowledge_base
),
Tool(
name="query_db",
description="Execute a SQL query on the data warehouse",
parameters={
"type": "object",
"properties": {
"sql": {"type": "string", "description": "SQL query"}
},
"required": ["sql"]
},
fn=query_database
),
Tool(
name="calculate",
description="Perform mathematical calculations",
parameters={
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Math expression"}
},
"required": ["expression"]
},
fn=calculate
),
Tool(
name="web_search",
description="Search the web for current events and information",
parameters={
"type": "object",
"properties": {
"query": {"type": "string", "description": "Web search query"}
},
"required": ["query"]
},
fn=web_search
)
]
print(f"Registered {len(tools)} tools")
A good tool registry has 5–15 tools: database/API access, search, calculation, external services. More tools introduce complexity and decision overhead.
Agentic Planning and Execution
The agent plans which tools to invoke:
def agentic_rag_loop(user_query: str, tools: list[Tool], max_iterations: int = 5) -> dict:
"""Execute the agentic RAG loop until a satisfactory answer is found."""
# Step 1: Understand the query
understanding_prompt = f"""Analyze this user query and extract key information:
Query: {user_query}
Extract:
1. Primary intent (what does the user want?)
2. Required information (what facts/data do we need?)
3. Constraints (time, domain, format)
4. Follow-up questions (what might we need to ask?)
Respond concisely."""
understanding = client.messages.create(
model="claude-haiku",
max_tokens=200,
messages=[{"role": "user", "content": understanding_prompt}]
)
print(f"Understanding: {understanding.content[0].text}\n")
# Step 2: Plan tool usage
tool_descriptions = "\n".join([
f"- {t.name}: {t.description}" for t in tools
])
planning_prompt = f"""Based on this query analysis, plan which tools to invoke.
Query: {user_query}
Analysis: {understanding.content[0].text}
Available tools:
{tool_descriptions}
Plan (in order):
1. Which tool to invoke first?
2. What parameters?
3. Based on that result, what's the next step?
Return JSON: {{
"steps": [
{{"tool": "...", "parameters": {{}}, "rationale": "..."}}
]
}}"""
planning = client.messages.create(
model="claude-opus-4-1",
max_tokens=400,
messages=[{"role": "user", "content": planning_prompt}]
)
text = planning.content[0].text
start = text.find('{')
end = text.rfind('}') + 1
plan = json.loads(text[start:end])
print(f"Plan: {len(plan['steps'])} steps\n")
# Step 3: Execute the plan
execution_log = []
for i, step in enumerate(plan['steps'][:max_iterations]):
tool_name = step['tool']
params = step['parameters']
# Find the tool
tool = next((t for t in tools if t.name == tool_name), None)
if not tool:
print(f"Tool {tool_name} not found")
continue
print(f"Step {i+1}: Invoking {tool_name} with {params}")
result = tool.invoke(**params)
execution_log.append({
"step": i + 1,
"tool": tool_name,
"result": result
})
print(f" Result: {result[:100]}...\n")
# Step 4: Synthesize response
synthesis_prompt = f"""Based on the execution log, synthesize a final answer to the user query.
Query: {user_query}
Execution Log:
{json.dumps(execution_log, indent=2)}
Generate a clear, concise answer that:
1. Directly addresses the query
2. Cites the data sources used
3. Acknowledges any limitations or uncertainties"""
synthesis = client.messages.create(
model="claude-opus-4-1",
max_tokens=400,
messages=[{"role": "user", "content": synthesis_prompt}]
)
return {
"query": user_query,
"understanding": understanding.content[0].text,
"plan": plan,
"execution_log": execution_log,
"response": synthesis.content[0].text
}
# Example
result = agentic_rag_loop("What were Microsoft's Q3 2025 revenues by region?", tools)
print(f"Final response:\n{result['response']}")
The agent iteratively refines its approach based on results. If the first search fails, it pivots to the database or web search.
Quality Assurance and Feedback Loop
Add explicit quality checks to the agentic loop:
def evaluate_response_quality(query: str, response: str, sources: list[str]) -> dict:
"""Evaluate the quality of the generated response."""
qa_prompt = f"""Evaluate this response on multiple dimensions:
Query: {query}
Response: {response}
Sources used: {sources}
Rate on:
1. Relevance (0-100): Does it answer the query?
2. Accuracy (0-100): Is the information correct and well-sourced?
3. Completeness (0-100): Does it address all aspects?
4. Clarity (0-100): Is it easy to understand?
5. Confidence (0-100): How confident should the user be?
Return JSON: {{
"scores": {{"relevance": X, "accuracy": Y, ...}},
"overall_quality": Z,
"issues": ["issue1", "issue2"],
"recommendation": "ACCEPT|REFINE|ESCALATE"
}}"""
qa = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
messages=[{"role": "user", "content": qa_prompt}]
)
text = qa.content[0].text
start = text.find('{')
end = text.rfind('}') + 1
return json.loads(text[start:end])
def feedback_loop(query: str, response: str, user_feedback: str) -> dict:
"""Learn from user feedback and improve."""
improvement_prompt = f"""The user gave this feedback on an AI response:
Query: {query}
AI Response: {response}
User Feedback: {user_feedback}
Analyze:
1. What went wrong?
2. Which tool/strategy would have worked better?
3. What should we remember for similar queries in the future?
Return JSON: {{
"root_cause": "...",
"better_strategy": "...",
"learning": "..."
}}"""
improvement = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
messages=[{"role": "user", "content": improvement_prompt}]
)
text = improvement.content[0].text
start = text.find('{')
end = text.rfind('}') + 1
return json.loads(text[start:end])
# Example
quality = evaluate_response_quality(
"What were Q3 revenues?",
"Microsoft Q3 2025 revenues were $67.2B",
["https://investor.microsoft.com"]
)
print(f"Quality: {quality['overall_quality']}/100")
if quality['recommendation'] == "REFINE":
print(f"Issues: {quality['issues']}")
Quality checks catch incomplete or inaccurate responses before they reach users. Feedback loops enable continuous improvement.
Deployment and Monitoring
Deploy agentic RAG with proper observability:
class ProductionAgentic RAG:
"""Production-ready agentic RAG system with monitoring."""
def __init__(self, tools: list[Tool]):
self.tools = tools
self.metrics = {
"total_queries": 0,
"successful": 0,
"hallucinations": 0,
"avg_latency_ms": 0
}
def process_query(self, query: str) -> dict:
"""Process a query with monitoring."""
import time
start = time.time()
try:
# Execute agentic loop
result = agentic_rag_loop(query, self.tools, max_iterations=5)
# Quality check
quality = evaluate_response_quality(
query, result['response'],
[f"tool_{i}" for i in range(len(result['execution_log']))]
)
latency_ms = (time.time() - start) * 1000
# Update metrics
self.metrics["total_queries"] += 1
if quality['overall_quality'] >= 70:
self.metrics["successful"] += 1
self.metrics["avg_latency_ms"] = (
self.metrics["avg_latency_ms"] * 0.9 + latency_ms * 0.1
)
# Log for monitoring (send to Datadog, CloudWatch, etc.)
self.log_event({
"query": query,
"response_length": len(result['response']),
"tools_used": len(result['execution_log']),
"quality_score": quality['overall_quality'],
"latency_ms": latency_ms,
"timestamp": time.time()
})
return {
**result,
"quality": quality,
"latency_ms": latency_ms
}
except Exception as e:
self.metrics["hallucinations"] += 1
print(f"Error: {e}")
return {
"error": str(e),
"latency_ms": (time.time() - start) * 1000
}
def log_event(self, event: dict):
"""Send event to monitoring service."""
# In production: send to Datadog, Prometheus, CloudWatch
# metrics["query_latency"] = event["latency_ms"]
# metrics["quality_score"] = event["quality_score"]
pass
def health_check(self) -> dict:
"""Return system health metrics."""
success_rate = (self.metrics["successful"] /
max(1, self.metrics["total_queries"])) * 100
return {
"total_queries": self.metrics["total_queries"],
"success_rate": f"{success_rate:.1f}%",
"avg_latency_ms": f"{self.metrics['avg_latency_ms']:.0f}",
"hallucination_count": self.metrics["hallucinations"],
"status": "HEALTHY" if success_rate > 85 else "DEGRADED"
}
# Deploy
rag_system = ProductionAgentic RAG(tools)
health = rag_system.health_check()
print(f"System Health: {health['status']}")
Production systems need monitoring to catch degradation (latency spikes, quality drops, hallucination increases) in real time.
Architecture Best Practices
| Component | Best Practice |
|---|---|
| Query parsing | Use LLM for intent understanding; fallback to heuristics |
| Tool selection | Use weighted scoring; prefer tools with lower latency/cost |
| Error handling | Graceful degradation; always have fallback tool |
| Caching | Cache query understanding, tool results (5–15 min TTL) |
| Latency budget | Target 1–2 sec total; break down per component |
| Monitoring | Track latency, quality, tool usage, error rates |
| Feedback loop | Log 1% of queries for human review weekly |
Key Takeaways
- Agentic RAG combines planning, multi-tool execution, self-grading, and feedback loops into a resilient system that reduces hallucination by 50–60%.
- Define a tool registry (5–15 tools) covering databases, search, APIs, and external services. Let the agent dynamically choose tools based on the query.
- Use iterative planning and execution: analyze the query, plan tool usage, execute, check quality, refine if needed.
- Add quality checks and feedback loops to catch errors and learn from users.
- Deploy with comprehensive monitoring: track latency, success rate, quality scores, and tool usage to catch degradation early.
Frequently Asked Questions
How many tools is too many?
More than 15 tools increases decision overhead and latency (agent spends longer choosing). Start with 5–7 core tools (search, database, APIs); add tools only when needed for specific use cases.
What if the agent makes a bad tool choice?
Design fallback chains. If the agent queries the database and gets no results, it should try web search. Log bad choices and retrain the planning prompt monthly with examples of good/bad decisions.
How do I handle real-time data consistency?
Use caching strategically: cache query understanding (stable for days), cache static knowledge base searches (stable for weeks), but query databases fresh (or with short TTL of 1–5 min for fact tables). Mark cached results as such in the response.
Should I scale agentic RAG to 1000+ queries per second?
Not recommended. Agentic systems have higher latency (1–3 sec per query). For high-volume, use simpler patterns (vector search only) in a fast-path; route complex queries to agentic RAG. Typical split: 80% simple (fast-path), 20% complex (agentic).
How do I handle tool failures (e.g., database downtime)?
Implement circuit breakers. After 3 consecutive tool failures, mark it as unavailable for 5 minutes. Redirect queries to alternative tools. Alert on-call ops. Gracefully degrade: if database is down, search the knowledge base instead.
Further Reading
- Autonomous Agents with Tool Use — Lian Weng's comprehensive guide on agentic systems.
- ReAct: Synergizing Reasoning and Acting in Language Models — framework for reasoning + tool use in agents.
- Tool Use in Large Language Models — techniques for effective tool selection and invocation.
- Production ML Systems — best practices for deploying ML systems at scale (applicable to RAG).