Error Tracking & Root-Cause Debugging for LLM Failures
Error tracking for LLM applications requires categorizing failures into distinct buckets (API errors, parsing errors, data errors, system errors), correlating them across distributed systems via trace IDs, and building recovery logic to gracefully degrade or retry. Unlike traditional software where errors are deterministic (a database constraint violation), LLM failures are often non-deterministic: a prompt might fail to parse one day but succeed the next if the model's inference slightly changes. Effective error tracking captures context (the exact prompt, model version, and preceding operations) so you can reproduce and fix failures.
Error Categories in LLM Applications
LLM failures fall into several categories:
| Category | Example | Typical Cause | Recovery |
|---|---|---|---|
| API error | Rate limit (429), auth failure (401), timeout | API provider saturation or credential issue | Retry with exponential backoff; upgrade tier |
| Parsing error | JSON output invalid, expected field missing | Model output doesn't match expected schema | Retry with stricter prompt; use response format constraints |
| Data error | Missing embedding, corrupted database record | Upstream data pipeline failure | Fallback to cache; skip operation |
| Model error | Output exceeds max_tokens; unsupported tokens | Prompt too long or model parameter misconfiguration | Truncate prompt; adjust temperature |
| System error | Out of memory, file not found | Host resource exhaustion or misconfiguration | Scale up; fix configuration |
Each category demands a different recovery strategy. An API rate-limit error justifies a retry; a parsing error requires prompt refinement or fallback.
Structured Error Logging
Log errors with sufficient context for root-cause analysis:
import logging
import json
import traceback
from datetime import datetime
logger = logging.getLogger(__name__)
def log_error(error_category: str, message: str, context: dict, exception: Exception = None):
"""Log a structured error with full context."""
error_log = {
"timestamp": datetime.utcnow().isoformat() + "Z",
"level": "error",
"error_category": error_category, # api_error, parsing_error, etc.
"message": message,
"context": context,
"stack_trace": traceback.format_exc() if exception else None
}
logger.error(json.dumps(error_log))
def llm_call_with_error_tracking(user_message: str, prompt_template: str):
"""LLM call with categorized error tracking."""
from anthropic import Anthropic, RateLimitError, APIError
import json
try:
client = Anthropic()
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)
# Attempt to parse response (if expecting JSON)
try:
result = json.loads(response.content[0].text)
return result
except json.JSONDecodeError as e:
log_error(
error_category="parsing_error",
message="Failed to parse LLM output as JSON",
context={
"raw_output": response.content[0].text[:500], # Truncate for safety
"prompt_length": len(user_message),
"model": "claude-3-opus-20250219"
},
exception=e
)
raise
except RateLimitError as e:
log_error(
error_category="api_error",
message="Rate limit exceeded",
context={
"api_provider": "anthropic",
"retry_after": e.response.headers.get("retry-after", "unknown")
},
exception=e
)
raise
except APIError as e:
log_error(
error_category="api_error",
message=f"API error: {e.status_code}",
context={
"api_provider": "anthropic",
"status_code": e.status_code,
"error_type": type(e).__name__
},
exception=e
)
raise
except Exception as e:
log_error(
error_category="system_error",
message="Unexpected error",
context={"error_type": type(e).__name__},
exception=e
)
raise
Each error log includes a category, allowing you to query: "Show me all parsing_errors in the last 24 hours" to identify a systematic issue with output formatting.
Root-Cause Analysis via Distributed Traces
When an error occurs, distributed traces reveal the exact preceding operations and their durations. Combined with logs, traces enable root-cause debugging:
from opentelemetry import trace
from anthropic import Anthropic, APIError
tracer = trace.get_tracer(__name__)
def context_retrieval_pipeline(user_query: str):
"""Multi-step pipeline with error context in traces."""
with tracer.start_as_current_span("context_pipeline") as root_span:
root_span.set_attribute("user_query_length", len(user_query))
# Step 1: Retrieve documents
with tracer.start_as_current_span("retrieve_docs") as span:
try:
# Simulate document retrieval
docs = retrieve_documents(user_query)
span.set_attribute("doc_count", len(docs))
except Exception as e:
span.set_attribute("status", "error")
span.set_attribute("error_type", type(e).__name__)
span.record_exception(e)
log_error("data_error", "Document retrieval failed",
{"query": user_query}, e)
raise
# Step 2: Embed query
with tracer.start_as_current_span("embed_query") as span:
try:
embedding = embed_query(user_query)
span.set_attribute("embedding_dim", len(embedding))
except Exception as e:
span.set_attribute("status", "error")
span.record_exception(e)
log_error("api_error", "Embedding API failed",
{"embedding_model": "text-embedding-3-small"}, e)
raise
# Step 3: Vector search
with tracer.start_as_current_span("vector_search") as span:
try:
results = vector_search(embedding)
span.set_attribute("results_count", len(results))
except Exception as e:
span.set_attribute("status", "error")
log_error("data_error", "Vector search failed",
{"db": "pinecone", "embedding_dim": len(embedding)}, e)
raise
# Step 4: LLM call
with tracer.start_as_current_span("llm_call") as span:
try:
client = Anthropic()
prompt = assemble_prompt(user_query, docs, results)
span.set_attribute("prompt_length", len(prompt))
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
span.set_attribute("stop_reason", response.stop_reason)
return response.content[0].text
except APIError as e:
span.set_attribute("status", "error")
span.set_attribute("error_code", e.status_code)
span.record_exception(e)
log_error("api_error", f"LLM API error {e.status_code}",
{"model": "claude-3-opus-20250219",
"prompt_length": len(prompt)}, e)
raise
def retrieve_documents(query):
# Simulate document retrieval
return [{"id": 1, "text": "sample doc"}]
def embed_query(query):
# Simulate embedding
return [0.1] * 1536
def vector_search(embedding):
# Simulate vector search
return [{"id": 1, "score": 0.95}]
def assemble_prompt(query, docs, results):
return f"Query: {query}\nDocs: {docs}\nResults: {results}"
In the trace, if the LLM call fails after successful document retrieval and embedding, you immediately know: the error is in the LLM step, not upstream. If embedding fails, the problem is the embedding API, not the model.
Error Recovery Strategies
Implement graceful error recovery based on error category:
import time
from anthropic import Anthropic, RateLimitError
def llm_call_with_retry(user_message: str, max_retries: int = 3):
"""LLM call with exponential backoff retry logic."""
for attempt in range(max_retries):
try:
client = Anthropic()
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
except RateLimitError as e:
# Rate limit: wait and retry
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff: 1s, 2s, 4s
log_error("api_error", f"Rate limited. Retrying after {wait_time}s",
{"attempt": attempt + 1, "max_retries": max_retries})
time.sleep(wait_time)
else:
log_error("api_error", "Rate limit exceeded after max retries",
{"max_retries": max_retries})
raise
except Exception as e:
# Non-retryable error: fail fast
log_error("system_error", f"Non-retryable error: {type(e).__name__}",
{"attempt": attempt + 1})
raise
raise RuntimeError("Failed after max retries")
def llm_call_with_fallback(user_message: str, fallback_response: str = None):
"""LLM call with fallback to cached response on failure."""
try:
client = Anthropic()
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
except Exception as e:
# Fall back to cached response if available
if fallback_response:
log_error("system_error", "LLM call failed, using fallback",
{"fallback_used": True})
return fallback_response
else:
log_error("system_error", "LLM call failed, no fallback available",
{"fallback_used": False})
raise
Retry is appropriate for transient API errors. Fallback (returning a cached or default response) is appropriate for non-critical operations where some response is better than none.
Alerting on Error Spikes
Define alert rules that trigger when error rates exceed thresholds:
# Prometheus alert rules for LLM error monitoring
groups:
- name: llm_errors
rules:
- alert: HighParsingErrorRate
expr: |
(increase(llm_errors_total{category="parsing_error"}[5m]) /
increase(llm_calls_total[5m])) > 0.05
for: 5m
annotations:
summary: "Parsing error rate exceeds 5%"
description: "{{ $value | humanizePercentage }} of LLM calls are failing to parse output"
- alert: RateLimitExceeded
expr: increase(llm_errors_total{category="api_error", error_code="429"}[1m]) > 10
for: 1m
annotations:
summary: "Rate limit errors detected"
description: "{{ $value }} rate-limit errors in the last minute"
- alert: DataRetrievalFailure
expr: increase(llm_errors_total{category="data_error"}[5m]) > 5
for: 5m
annotations:
summary: "Data retrieval errors increasing"
description: "{{ $value }} data errors in the last 5 minutes"
These rules fire and trigger notifications (Slack, PagerDuty) when error patterns deviate from normal.
Key Takeaways
- Categorize errors (API, parsing, data, system) to enable targeted recovery strategies.
- Log errors with full context: the request, preceding operations, and stack trace.
- Distributed traces reveal error scope: did the error occur in data retrieval, embedding, vector search, or LLM inference?
- Implement category-specific recovery: retry for transient API errors; fallback for non-critical operations; fail fast for non-retryable errors.
- Alert on error rates and error spikes to catch systematic issues early.
Frequently Asked Questions
How do I distinguish between a transient API error and a bug in my code?
Transient errors (rate limits, timeouts) appear intermittently across all requests. Bugs are reproducible with specific inputs. Use distributed traces to check: if the same span fails consistently for the same input, it is a bug. If different spans fail at random, it is likely transient.
Should I log the full prompt if it contains sensitive data?
No. Truncate prompts in logs and use a separate PII redaction rule in your log aggregation platform (Datadog, Splunk) to mask emails, credit card numbers, etc. Alternatively, hash the prompt and log only the hash for debugging without exposing content.
What is a good error recovery timeout?
For rate-limit errors, use exponential backoff starting at 1 second and capping at 1 minute. For other transient errors, retry 3 times with a max wait of 30 seconds. For non-transient errors, fail immediately (do not retry).
How do I correlate errors across services in a microservices architecture?
Pass the trace ID in all requests (HTTP headers, message queues, logs). Log aggregation tools automatically correlate logs with the same trace ID, showing you the full error flow across services.
Further Reading
- Sentry Error Tracking — Commercial error tracking platform with distributed tracing
- Datadog Error Tracking — Error tracking in Datadog APM
- Anthropic API Error Handling — Best practices for handling Anthropic API errors
- Distributed Systems Debugging — Google research on debugging distributed systems