Tool Error Handling: Retries and Graceful Fallbacks
Tool calls fail. Network timeouts, rate limits, permission errors, and server outages are inevitable. A robust agent does not crash on the first failure—it retries transient errors, falls back to alternatives, and recovers gracefully. Proper error handling turns a brittle system into a resilient one. Studies show that implementing exponential backoff and circuit breakers reduces workflow failures by 70–85% and improves customer satisfaction measurably.
Types of Tool Errors
Tools fail in different ways, and each requires a different response.
Transient errors (retry-worthy): Timeout, rate limit (429), temporary service unavailable (503), network blip. These usually pass on retry.
Permanent errors (do not retry): Invalid arguments, permission denied (403), not found (404), invalid schema. Retrying will not help.
Partial failures (recover): Tool returns a partial result or a degraded response. Use what you got and adjust downstream.
def classify_error(exception, status_code=None):
"""Classify an error as transient or permanent."""
transient_codes = {408, 429, 500, 502, 503, 504} # Timeout, rate limit, server errors
permanent_codes = {400, 401, 403, 404} # Bad request, auth, permission, not found
if isinstance(exception, TimeoutError):
return "transient"
if status_code in transient_codes:
return "transient"
if status_code in permanent_codes:
return "permanent"
if "rate limit" in str(exception).lower():
return "transient"
return "unknown" # Assume transient if unsure
Strategy 1: Exponential Backoff
When a tool fails with a transient error, retry with increasing delays. The first retry is immediate (or 1s), the second is 2s, the third is 4s, etc. This prevents hammering a struggling service.
import time
import random
def call_tool_with_backoff(tool_func, *args, max_retries=3, **kwargs):
"""Call a tool with exponential backoff on transient failure."""
for attempt in range(max_retries):
try:
return tool_func(*args, **kwargs)
except Exception as e:
error_type = classify_error(e)
if error_type == "permanent":
# No point retrying
raise
if attempt < max_retries - 1:
# Calculate backoff: 2^attempt + jitter
backoff = (2 ** attempt) + random.uniform(0, 1)
print(f"Attempt {attempt + 1} failed, retrying in {backoff:.1f}s...")
time.sleep(backoff)
else:
raise
return None
# Usage
def fetch_user_from_db(user_id):
"""Simulated database call."""
import random
if random.random() < 0.3: # 30% failure rate
raise TimeoutError("Database timeout")
return {"id": user_id, "name": "Alice"}
result = call_tool_with_backoff(fetch_user_from_db, 123)
print(result)
Exponential backoff is simple and effective. The strategy automatically backs off: fast retries for quick glitches, slower retries for sustained outages. After max_retries attempts, the error propagates to the model, which can report it to the user or try an alternative tool.
Strategy 2: Circuit Breaker
A circuit breaker prevents cascading failures. If a tool fails repeatedly, the circuit breaker "opens" and stops sending requests for a period, allowing the backend to recover.
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing; reject requests
HALF_OPEN = "half_open" # Testing if recovered
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.state = CircuitState.CLOSED
self.failures = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.last_failure_time = None
def call(self, func, *args, **kwargs):
"""Execute func with circuit breaker logic."""
if self.state == CircuitState.OPEN:
# Check if we should transition to HALF_OPEN
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
self.failures = 0
else:
raise Exception(f"Circuit open (too many failures). Retry after {self.timeout}s.")
try:
result = func(*args, **kwargs)
# Success: reset circuit
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.failure_threshold:
self.state = CircuitState.OPEN
raise
# Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)
def fetch_external_api(endpoint):
"""Call an external API."""
import random
if random.random() < 0.4:
raise Exception("API error")
return {"data": "success"}
try:
for i in range(10):
result = breaker.call(fetch_external_api, "users")
print(f"Attempt {i + 1}: ✓")
except Exception as e:
print(f"Circuit breaker: {e}")
A circuit breaker is crucial for agent systems that call external services. Without one, a failing API can waste tokens and slow down the agent; with one, the agent quickly learns to stop hammering it and tries alternatives.
Strategy 3: Fallback Tools
Some tools have backup alternatives. If the primary tool fails, try a secondary.
def fetch_user_data(user_id, preferred_source="api", fallback_source="database"):
"""Fetch user data, falling back if primary source fails."""
sources = {
"api": lambda uid: call_tool_with_backoff(fetch_from_api, uid),
"database": lambda uid: call_tool_with_backoff(fetch_from_db, uid),
"cache": lambda uid: call_tool_with_backoff(fetch_from_cache, uid)
}
for source in [preferred_source, fallback_source]:
try:
return sources[source](user_id)
except Exception as e:
print(f"{source} failed: {e}")
if source == fallback_source:
raise Exception("All fallbacks exhausted")
return None
Fallback chains let agents degrade gracefully. If the API times out, use the database. If the database is down, use the cache (even if stale). This ensures the agent can often complete its task even when one path fails.
Strategy 4: Informative Error Messages
When a tool fails, return an error message that tells the model what happened and suggests next steps.
def execute_tool_safely(tool_name, arguments, tool_impl, backoff_retries=3):
"""Execute a tool and return a result or error."""
try:
result = call_tool_with_backoff(
tool_impl,
**arguments,
max_retries=backoff_retries
)
return {
"status": "success",
"data": result
}
except TimeoutError:
return {
"status": "error",
"error_type": "timeout",
"message": "Tool timed out. Try a simpler query or try again.",
"recovery_suggestions": ["Simplify the request", "Wait 30 seconds", "Use an alternative tool"]
}
except PermissionError as e:
return {
"status": "error",
"error_type": "permission",
"message": f"Permission denied: {str(e)}. You do not have access.",
"recovery_suggestions": ["Request access", "Use a different tool"]
}
except Exception as e:
return {
"status": "error",
"error_type": "unknown",
"message": f"Tool failed: {str(e)}",
"recovery_suggestions": ["Retry", "Check arguments", "Contact support"]
}
# The model sees:
# {
# "status": "error",
# "error_type": "timeout",
# "message": "Tool timed out. Try a simpler query or try again.",
# "recovery_suggestions": ["Simplify the request", "Wait 30 seconds", "Use an alternative tool"]
# }
# The model can then choose to retry, simplify, or try a different tool.
Informative errors are critical. A vague error ("error") tells the model nothing. A detailed error ("timeout after 10 seconds; suggests: wait 30s, simplify query, try cache") guides the model toward recovery.
Integration: Complete Error Handling Pattern
def orchestrated_tool_call(tool_name, arguments, tools_config, max_agent_retries=2):
"""Orchestrate tool execution with full error handling."""
tool_cfg = tools_config[tool_name]
circuit_breaker = tool_cfg.get("circuit_breaker")
max_backoff_retries = tool_cfg.get("max_retries", 3)
fallback_tool = tool_cfg.get("fallback")
# Primary attempt with circuit breaker + backoff
try:
if circuit_breaker:
result = circuit_breaker.call(
call_tool_with_backoff,
tool_cfg["impl"],
**arguments,
max_retries=max_backoff_retries
)
else:
result = call_tool_with_backoff(
tool_cfg["impl"],
**arguments,
max_retries=max_backoff_retries
)
return {"status": "success", "data": result}
except Exception as e:
error_type = classify_error(e)
# Try fallback tool if available
if fallback_tool and error_type == "transient":
try:
result = call_tool_with_backoff(
tools_config[fallback_tool]["impl"],
**arguments,
max_retries=max_backoff_retries
)
return {"status": "success_via_fallback", "fallback_tool": fallback_tool, "data": result}
except Exception as fallback_e:
pass # Both failed
# Return informative error
return {
"status": "error",
"error_type": error_type,
"message": str(e),
"recovery_suggestions": get_recovery_suggestions(error_type, tool_name, fallback_tool)
}
def get_recovery_suggestions(error_type, tool_name, fallback_tool):
suggestions = []
if error_type == "transient":
suggestions.append("Retry after a short delay")
if error_type == "permanent":
suggestions.append("Check that arguments are valid")
if fallback_tool:
suggestions.append(f"Try {fallback_tool} as alternative")
suggestions.append("Contact support if issue persists")
return suggestions
Key Takeaways
- Classify errors as transient (retry-worthy) or permanent (skip).
- Use exponential backoff to retry transient failures without hammering the service.
- Implement circuit breakers to prevent cascading failures.
- Define fallback tools for redundancy.
- Return informative error messages that guide the model toward recovery.
Frequently Asked Questions
How long should I retry?
Start with 3 retries and a maximum backoff of 30–60 seconds. For critical tools, extend to 5 retries and 120s. For tools with published SLAs, match retry timeouts to the SLA.
Should I retry on rate limit (429)?
Yes. Rate limits are transient; the service is asking you to slow down. Exponential backoff is exactly the right behavior for rate limits. Include a Retry-After header check if provided.
What is a good circuit breaker threshold?
Start with 5 consecutive failures or 10 failures in a 5-minute window. Tune based on your tool's failure mode: flaky tools need higher thresholds; critical tools need lower.
How do I know which tool is the fallback?
Define fallbacks in your tool configuration explicitly. Example: { "name": "fetch_api", "fallback": "fetch_cache" }. Avoid chaining too many fallbacks (A → B → C → D); keep it to 2–3 levels.
Can the model learn to use fallbacks?
Yes. If you include fallback information in tool descriptions or system prompts, the model can choose fallbacks proactively. Example: "search_web (or use search_cache if the web is slow)".