Error Handling: Research Agent Robustness
Error handling separates a prototype research agent from a production system. In the wild, networks fail, APIs rate-limit, websites serve corrupted HTML, and LLMs occasionally return unparseable JSON. A robust agent gracefully degrades rather than crashing: it retries with backoff, skips failed sources, falls back to cached data, and completes the best report it can with available information. This article teaches you to build resilience into every layer of the research pipeline.
The goal is not to prevent all errors—that's impossible—but to detect errors early, communicate their severity clearly, and continue executing without compromising the final report. A well-engineered agent can lose 20–30% of its sources to failures and still produce a credible, citable report.
Implementing Exponential Backoff for Transient Failures
Network and API errors are often transient (temporary). Exponential backoff increases the wait time between retries, giving services time to recover while avoiding thundering herd effects:
import time
import random
from typing import Callable, TypeVar, Any
T = TypeVar('T')
def retry_with_backoff(
func: Callable[..., T],
*args,
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0,
jitter: bool = True,
**kwargs
) -> tuple[T | None, bool]:
"""
Retry a function with exponential backoff.
Returns: (result, success)
"""
for attempt in range(max_retries):
try:
return func(*args, **kwargs), True
except Exception as e:
if attempt == max_retries - 1:
# Last attempt; return error
print(f"Failed after {max_retries} attempts: {str(e)}")
return None, False
# Calculate backoff: 2^attempt * base_delay, capped at max_delay
delay = min(base_delay * (2 ** attempt), max_delay)
# Add jitter (±25%) to spread retries
if jitter:
delay *= (0.75 + 0.5 * random.random())
print(f"Attempt {attempt + 1} failed. Retrying in {delay:.1f}s...")
time.sleep(delay)
return None, False
# Example: retry a search
def search_api_call():
# Simulate an API that fails 50% of the time
if random.random() < 0.5:
raise ConnectionError("Network timeout")
return ["result1", "result2"]
results, success = retry_with_backoff(
search_api_call,
max_retries=3,
base_delay=1.0
)
if success:
print(f"Success: {results}")
else:
print("Failed after retries; moving on")
Validating and Sanitizing LLM Output
LLMs occasionally return malformed JSON or invalid data. Validate before using:
import json
from typing import Optional
def parse_json_response(
text: str,
expected_keys: list[str] = None,
fallback: dict = None
) -> Optional[dict]:
"""
Safely parse JSON from LLM response with validation.
Returns: parsed dict or fallback (or None if all fail)
"""
if fallback is None:
fallback = {}
# Attempt 1: Direct JSON parsing
try:
data = json.loads(text)
if validate_json_keys(data, expected_keys):
return data
except json.JSONDecodeError:
pass
# Attempt 2: Extract JSON from markdown fence
try:
start = text.find('{')
end = text.rfind('}') + 1
if start >= 0 and end > start:
data = json.loads(text[start:end])
if validate_json_keys(data, expected_keys):
return data
except json.JSONDecodeError:
pass
# Attempt 3: Extract from ```json fence
try:
start = text.find('```json')
if start >= 0:
start += 7
end = text.find('```', start)
if end > start:
data = json.loads(text[start:end].strip())
if validate_json_keys(data, expected_keys):
return data
except json.JSONDecodeError:
pass
# All parsing attempts failed
print(f"Could not parse valid JSON. Using fallback.")
return fallback
def validate_json_keys(data: dict, expected_keys: list[str]) -> bool:
"""Check that parsed JSON has expected keys."""
if expected_keys is None:
return isinstance(data, dict)
return all(key in data for key in expected_keys)
# Example
llm_response = """
Here's the structured data:
{
"claims": ["Claim 1", "Claim 2"],
"certainty": "high"
}
"""
parsed = parse_json_response(
llm_response,
expected_keys=["claims", "certainty"],
fallback={"claims": [], "certainty": "low"}
)
print(parsed)
Handling Fetch and Parsing Failures Gracefully
Web pages fail to fetch for many reasons. Log each failure but continue:
from dataclasses import dataclass
from enum import Enum
class FetchErrorType(Enum):
TIMEOUT = "timeout"
HTTP_4XX = "http_4xx"
HTTP_5XX = "http_5xx"
PARSING_ERROR = "parsing_error"
PAYWALL = "paywall"
UNKNOWN = "unknown"
@dataclass
class FetchResult:
"""Result of a fetch operation (success or failure)."""
success: bool
url: str
text: Optional[str] = None
error_type: Optional[FetchErrorType] = None
error_message: Optional[str] = None
def is_retryable(self) -> bool:
"""Can this error be resolved by retrying?"""
return self.error_type in {
FetchErrorType.TIMEOUT,
FetchErrorType.HTTP_5XX
}
def fetch_with_fallbacks(
url: str,
timeout: int = 10,
cache: dict = None
) -> FetchResult:
"""
Fetch a URL with multiple fallback strategies.
"""
import requests
if cache is None:
cache = {}
# Check cache first
if url in cache:
print(f"Cache hit: {url}")
return FetchResult(success=True, url=url, text=cache[url])
# Attempt 1: Simple HTTP fetch
try:
response = requests.get(
url,
headers={"User-Agent": "ResearchAgent/1.0"},
timeout=timeout
)
if response.status_code == 200:
cache[url] = response.text
return FetchResult(success=True, url=url, text=response.text)
elif response.status_code == 403:
return FetchResult(
success=False,
url=url,
error_type=FetchErrorType.PAYWALL,
error_message="Access denied (likely paywalled)"
)
elif 400 <= response.status_code < 500:
return FetchResult(
success=False,
url=url,
error_type=FetchErrorType.HTTP_4XX,
error_message=f"HTTP {response.status_code}"
)
else:
return FetchResult(
success=False,
url=url,
error_type=FetchErrorType.HTTP_5XX,
error_message=f"HTTP {response.status_code} (retryable)"
)
except requests.exceptions.Timeout:
return FetchResult(
success=False,
url=url,
error_type=FetchErrorType.TIMEOUT,
error_message="Request timeout (retryable)"
)
except Exception as e:
return FetchResult(
success=False,
url=url,
error_type=FetchErrorType.UNKNOWN,
error_message=str(e)
)
# Usage
result = fetch_with_fallbacks("https://example.com/article")
if result.success:
print(f"Fetched {len(result.text)} chars")
else:
print(f"Fetch failed: {result.error_message}")
if result.is_retryable():
print("This error is retryable.")
Logging and Monitoring for Post-Mortem Analysis
Log every significant event so you can debug failures later:
import json
from datetime import datetime
class ResearchLog:
"""Simple file-based logging for research operations."""
def __init__(self, log_file: str = "research_agent.log"):
self.log_file = log_file
def log_event(
self,
event_type: str,
details: dict,
severity: str = "info"
):
"""Log an event (search, fetch, extract, verify, error)."""
entry = {
"timestamp": datetime.now().isoformat(),
"event_type": event_type,
"severity": severity,
"details": details
}
with open(self.log_file, "a") as f:
f.write(json.dumps(entry) + "\n")
def log_search(self, query: str, result_count: int):
self.log_event("search", {"query": query, "results": result_count})
def log_fetch(self, url: str, success: bool, error: str = None):
details = {"url": url, "success": success}
if error:
details["error"] = error
severity = "warning" if not success else "info"
self.log_event("fetch", details, severity=severity)
def log_extraction(self, url: str, claim_count: int):
self.log_event("extraction", {"url": url, "claims": claim_count})
def log_error(self, step: str, error: str):
self.log_event("error", {"step": step, "error": error}, severity="error")
def summary(self) -> dict:
"""Generate a summary from log."""
counts = {}
errors = []
with open(self.log_file, "r") as f:
for line in f:
entry = json.loads(line)
event_type = entry["event_type"]
counts[event_type] = counts.get(event_type, 0) + 1
if entry["severity"] == "error":
errors.append(entry["details"])
return {
"event_counts": counts,
"errors": errors,
"total_events": sum(counts.values())
}
# Example usage
logger = ResearchLog()
logger.log_search("AI chip manufacturing", 45)
logger.log_fetch("https://example.com", success=False, error="Timeout")
logger.log_error("verification", "JSON parse failed")
summary = logger.summary()
print(f"Processed {summary['total_events']} events")
print(f"Errors encountered: {len(summary['errors'])}")
Key Takeaways
- Implement exponential backoff with jitter for transient failures (timeouts, 5XX errors); set max_retries = 3 and base_delay = 1 second.
- Validate LLM output by checking JSON structure and expected keys; provide sensible fallbacks when parsing fails.
- Classify fetch errors (retryable vs. terminal) and continue processing with partial data rather than stopping.
- Log all significant events (search, fetch, extract, error) to a file for post-mortem analysis and monitoring.
Frequently Asked Questions
What's the difference between retrying and failing gracefully?
Retrying: "The API returned 503. Let me try again in 2 seconds." Failing gracefully: "The API returned 403. I can't access this page, but I'll keep going with other sources instead of crashing."
Should I retry 403 (Forbidden) errors?
No. 403 is terminal (usually paywall or rate limit). Retry only: 408 (timeout), 429 (rate limit—though wait longer), 500–599 (server errors).
How long should I wait between retries?
Start with 1 second. Double each time: 1s, 2s, 4s, 8s. Cap at 60s. If you've retried 3 times with 8-second waits and still failing, that source is not worth pursuing.
What if the LLM returns valid JSON but with unexpected values?
Log the issue as a warning, use the fallback, and continue. Example: If certainty should be "high"/"medium"/"low" but the LLM returned "very high", normalize it to "high" and log the oddity.