Fallback Strategies: When Validation Fails
Not every LLM output can be repaired. Sometimes the model is confused, the schema is wrong, or the LLM has hit a limit. Fallback strategies ensure your system degrades gracefully instead of crashing. A well-designed fallback hierarchy can maintain 99%+ uptime even when the primary LLM fails 10% of the time (Google, 2024).
This article teaches you fallback patterns, from cached responses to secondary models to human escalation, and how to decide which is right for your use case.
Fallback Hierarchy
Design a fallback hierarchy from most to least preferred:
Primary LLM
↓ [fail]
Cache
↓ [miss]
Secondary LLM (cheaper or faster)
↓ [fail]
Template / Default Response
↓ [fail]
Human Escalation
↓ [fail]
Graceful Error to User
At each level, you try to provide valid output. If all else fails, you gracefully explain the failure to the user rather than crashing.
Level 1: Cached Responses
Maintain a cache of previous valid outputs. If the LLM fails, use a cached response for similar inputs.
import json
import hashlib
from functools import lru_cache
class OutputCache:
def __init__(self, max_size: int = 10000):
self.cache = {}
self.max_size = max_size
def _hash_input(self, text: str) -> str:
"""Hash input for cache key."""
return hashlib.md5(text.encode()).hexdigest()
def get(self, input_text: str) -> Optional[dict]:
"""Retrieve cached output for input."""
key = self._hash_input(input_text)
return self.cache.get(key)
def set(self, input_text: str, output: dict) -> None:
"""Cache validated output."""
if len(self.cache) >= self.max_size:
# Evict oldest entry (simple FIFO)
self.cache.pop(next(iter(self.cache)))
key = self._hash_input(input_text)
self.cache[key] = output
def fallback(self, input_text: str) -> Optional[dict]:
"""Use cached output as fallback."""
return self.get(input_text)
# Usage
cache = OutputCache()
def extract_with_fallback(text: str, schema: dict) -> Optional[dict]:
# Try cache first
cached = cache.fallback(text)
if cached:
print("Using cached response")
return cached
# Try primary LLM
output = llm_call(text, schema)
try:
parsed = json.loads(output)
jsonschema.validate(parsed, schema)
cache.set(text, parsed) # Cache successful output
return parsed
except (json.JSONDecodeError, jsonschema.ValidationError):
# Fallback: return cached response for similar input
similar = find_similar_cached(text, cache)
if similar:
return similar
return None
Cache is most effective for high-volume, high-repetition tasks (e.g., customer service, common questions).
Level 2: Secondary LLM
If the primary model fails, use a secondary model (cheaper, faster, or more reliable).
import anthropic
def extract_with_secondary_fallback(text: str, schema: dict) -> Optional[dict]:
"""Try primary model, fall back to secondary."""
models = [
("claude-3-5-sonnet-20241022", 2), # (model, max_retries)
("claude-3-opus-20250219", 1), # Fallback: smaller/faster
]
for model, max_retries in models:
client = anthropic.Anthropic()
for attempt in range(max_retries):
response = client.messages.create(
model=model,
max_tokens=500,
messages=[{
"role": "user",
"content": f"Extract: {text}\nSchema: {json.dumps(schema)}"
}]
)
output = response.content[0].text
try:
parsed = json.loads(output)
jsonschema.validate(parsed, schema)
return parsed
except (json.JSONDecodeError, jsonschema.ValidationError):
continue
return None
Secondary fallbacks are useful for reliability (two models are less likely to both fail) but add cost.
Level 3: Template / Default Response
When the LLM fails entirely, return a sensible default or template response.
def extract_with_template_fallback(text: str, schema: dict) -> dict:
"""Extract with fallback to template."""
# Try primary LLM
output = llm_call(text, schema)
try:
parsed = json.loads(output)
jsonschema.validate(parsed, schema)
return parsed
except (json.JSONDecodeError, jsonschema.ValidationError):
pass
# Fallback to template
template = {
"sentiment": "unknown",
"confidence": 0.0,
"summary": "Unable to analyze. Please try again."
}
return template
Templates should be sensible defaults that won't break downstream code. Use them when real responses aren't critical (e.g., optional metadata).
Level 4: Partial Response
Extract and return whatever valid data you can, even if incomplete.
def extract_with_partial_fallback(text: str, schema: dict) -> dict:
"""Extract with fallback to partial response."""
output = llm_call(text, schema)
try:
parsed = json.loads(output)
except json.JSONDecodeError:
# Fallback: extract valid fields manually
parsed = extract_partial(text)
# Validate and extract only required fields
validated = {}
for field in schema.get("required", []):
if field in parsed:
validated[field] = parsed[field]
else:
validated[field] = infer_default(field, schema)
return validated
def extract_partial(text: str) -> dict:
"""Manually extract partial data from text."""
result = {}
# Simple heuristics for common fields
if "positive" in text.lower():
result["sentiment"] = "positive"
elif "negative" in text.lower():
result["sentiment"] = "negative"
# Extract numbers (prices, ratings)
import re
numbers = re.findall(r"\d+\.?\d*", text)
if numbers:
result["rating"] = float(numbers[0])
return result
Partial responses are invaluable for extraction tasks where some data is better than none.
Level 5: Human Escalation
If all automatic strategies fail, escalate to a human.
def extract_with_escalation(
text: str,
schema: dict,
user_id: str,
escalation_queue: Queue
) -> Optional[dict]:
"""Extract with human escalation fallback."""
# Try automatic extraction
output = attempt_automatic_extraction(text, schema)
if output:
return output
# Escalate to human
escalation_queue.put({
"user_id": user_id,
"task": "manual_extraction",
"input": text,
"schema": schema,
"timestamp": datetime.now()
})
# Notify user
send_notification(
user_id,
"Your request requires human review. We'll get back to you within 2 hours."
)
return None
Escalation is your safety net. Log all escalations to improve your primary system.
Comparison: Fallback Strategies
| Strategy | Success Rate | Cost | Latency | Complexity |
|---|---|---|---|---|
| Cache | 70–90% (high-repetition) | Free | 1ms | Low |
| Secondary LLM | 90–95% | Medium (2x cost) | +500ms | Medium |
| Template | 95%+ | Free | 1ms | Low |
| Partial response | 80–90% | Low | Variable | Medium |
| Human escalation | 98%+ | High | Hours | High |
Implementing a Complete Fallback System
class RobustExtractor:
def __init__(self, schema: dict):
self.schema = schema
self.cache = OutputCache()
self.escalation_queue = Queue()
def extract(self, text: str, user_id: str) -> dict:
"""Extract with complete fallback hierarchy."""
# Level 1: Cache
cached = self.cache.fallback(text)
if cached:
return cached
# Level 2: Primary LLM
try:
output = llm_call(text, self.schema)
parsed = json.loads(output)
jsonschema.validate(parsed, self.schema)
self.cache.set(text, parsed)
return parsed
except Exception:
pass
# Level 3: Secondary LLM
try:
output = secondary_llm_call(text, self.schema)
parsed = json.loads(output)
jsonschema.validate(parsed, self.schema)
return parsed
except Exception:
pass
# Level 4: Partial response
partial = extract_partial(text)
if partial:
return partial
# Level 5: Escalate
self.escalation_queue.put({
"user_id": user_id,
"input": text,
"timestamp": datetime.now()
})
return None
Key Takeaways
- Fallback hierarchies ensure graceful degradation: cache → secondary → template → partial → escalate.
- Caching is the cheapest fallback, effective for repetitive tasks.
- Secondary models add redundancy but increase cost; use for high-value tasks.
- Partial responses are valuable for extraction; some data beats no data.
- Always escalate unrecoverable failures rather than failing silently.
- Log all fallback invocations to improve your primary system.
Frequently Asked Questions
How much should I invest in fallbacks?
For critical systems (e.g., payment processing), invest heavily: cache + secondary + escalation. For nice-to-have features (e.g., metadata), a simple template is enough.
Should I retry the primary LLM before falling back?
Yes, 1–3 retries (with corrective feedback) make sense. Beyond that, move to fallback. Retrying indefinitely wastes time and money.
How do I prevent stale cached responses?
Add TTL (time-to-live) to cached entries: cache[key] = (value, timestamp). Refresh cache entries periodically or when upstream data changes.
What if all fallbacks fail?
Return a clear error message to the user explaining what happened and next steps (e.g., "Please try again later"). Never silently fail.