Skip to main content

Fallback Strategies: When Validation Fails

Not every LLM output can be repaired. Sometimes the model is confused, the schema is wrong, or the LLM has hit a limit. Fallback strategies ensure your system degrades gracefully instead of crashing. A well-designed fallback hierarchy can maintain 99%+ uptime even when the primary LLM fails 10% of the time (Google, 2024).

This article teaches you fallback patterns, from cached responses to secondary models to human escalation, and how to decide which is right for your use case.

Fallback Hierarchy

Design a fallback hierarchy from most to least preferred:

Primary LLM
↓ [fail]
Cache
↓ [miss]
Secondary LLM (cheaper or faster)
↓ [fail]
Template / Default Response
↓ [fail]
Human Escalation
↓ [fail]
Graceful Error to User

At each level, you try to provide valid output. If all else fails, you gracefully explain the failure to the user rather than crashing.

Level 1: Cached Responses

Maintain a cache of previous valid outputs. If the LLM fails, use a cached response for similar inputs.

import json
import hashlib
from functools import lru_cache

class OutputCache:
def __init__(self, max_size: int = 10000):
self.cache = {}
self.max_size = max_size

def _hash_input(self, text: str) -> str:
"""Hash input for cache key."""
return hashlib.md5(text.encode()).hexdigest()

def get(self, input_text: str) -> Optional[dict]:
"""Retrieve cached output for input."""
key = self._hash_input(input_text)
return self.cache.get(key)

def set(self, input_text: str, output: dict) -> None:
"""Cache validated output."""
if len(self.cache) >= self.max_size:
# Evict oldest entry (simple FIFO)
self.cache.pop(next(iter(self.cache)))

key = self._hash_input(input_text)
self.cache[key] = output

def fallback(self, input_text: str) -> Optional[dict]:
"""Use cached output as fallback."""
return self.get(input_text)

# Usage
cache = OutputCache()

def extract_with_fallback(text: str, schema: dict) -> Optional[dict]:
# Try cache first
cached = cache.fallback(text)
if cached:
print("Using cached response")
return cached

# Try primary LLM
output = llm_call(text, schema)
try:
parsed = json.loads(output)
jsonschema.validate(parsed, schema)
cache.set(text, parsed) # Cache successful output
return parsed
except (json.JSONDecodeError, jsonschema.ValidationError):
# Fallback: return cached response for similar input
similar = find_similar_cached(text, cache)
if similar:
return similar
return None

Cache is most effective for high-volume, high-repetition tasks (e.g., customer service, common questions).

Level 2: Secondary LLM

If the primary model fails, use a secondary model (cheaper, faster, or more reliable).

import anthropic

def extract_with_secondary_fallback(text: str, schema: dict) -> Optional[dict]:
"""Try primary model, fall back to secondary."""

models = [
("claude-3-5-sonnet-20241022", 2), # (model, max_retries)
("claude-3-opus-20250219", 1), # Fallback: smaller/faster
]

for model, max_retries in models:
client = anthropic.Anthropic()

for attempt in range(max_retries):
response = client.messages.create(
model=model,
max_tokens=500,
messages=[{
"role": "user",
"content": f"Extract: {text}\nSchema: {json.dumps(schema)}"
}]
)

output = response.content[0].text
try:
parsed = json.loads(output)
jsonschema.validate(parsed, schema)
return parsed
except (json.JSONDecodeError, jsonschema.ValidationError):
continue

return None

Secondary fallbacks are useful for reliability (two models are less likely to both fail) but add cost.

Level 3: Template / Default Response

When the LLM fails entirely, return a sensible default or template response.

def extract_with_template_fallback(text: str, schema: dict) -> dict:
"""Extract with fallback to template."""

# Try primary LLM
output = llm_call(text, schema)
try:
parsed = json.loads(output)
jsonschema.validate(parsed, schema)
return parsed
except (json.JSONDecodeError, jsonschema.ValidationError):
pass

# Fallback to template
template = {
"sentiment": "unknown",
"confidence": 0.0,
"summary": "Unable to analyze. Please try again."
}

return template

Templates should be sensible defaults that won't break downstream code. Use them when real responses aren't critical (e.g., optional metadata).

Level 4: Partial Response

Extract and return whatever valid data you can, even if incomplete.

def extract_with_partial_fallback(text: str, schema: dict) -> dict:
"""Extract with fallback to partial response."""

output = llm_call(text, schema)

try:
parsed = json.loads(output)
except json.JSONDecodeError:
# Fallback: extract valid fields manually
parsed = extract_partial(text)

# Validate and extract only required fields
validated = {}
for field in schema.get("required", []):
if field in parsed:
validated[field] = parsed[field]
else:
validated[field] = infer_default(field, schema)

return validated

def extract_partial(text: str) -> dict:
"""Manually extract partial data from text."""
result = {}

# Simple heuristics for common fields
if "positive" in text.lower():
result["sentiment"] = "positive"
elif "negative" in text.lower():
result["sentiment"] = "negative"

# Extract numbers (prices, ratings)
import re
numbers = re.findall(r"\d+\.?\d*", text)
if numbers:
result["rating"] = float(numbers[0])

return result

Partial responses are invaluable for extraction tasks where some data is better than none.

Level 5: Human Escalation

If all automatic strategies fail, escalate to a human.

def extract_with_escalation(
text: str,
schema: dict,
user_id: str,
escalation_queue: Queue
) -> Optional[dict]:
"""Extract with human escalation fallback."""

# Try automatic extraction
output = attempt_automatic_extraction(text, schema)
if output:
return output

# Escalate to human
escalation_queue.put({
"user_id": user_id,
"task": "manual_extraction",
"input": text,
"schema": schema,
"timestamp": datetime.now()
})

# Notify user
send_notification(
user_id,
"Your request requires human review. We'll get back to you within 2 hours."
)

return None

Escalation is your safety net. Log all escalations to improve your primary system.

Comparison: Fallback Strategies

StrategySuccess RateCostLatencyComplexity
Cache70–90% (high-repetition)Free1msLow
Secondary LLM90–95%Medium (2x cost)+500msMedium
Template95%+Free1msLow
Partial response80–90%LowVariableMedium
Human escalation98%+HighHoursHigh

Implementing a Complete Fallback System

class RobustExtractor:
def __init__(self, schema: dict):
self.schema = schema
self.cache = OutputCache()
self.escalation_queue = Queue()

def extract(self, text: str, user_id: str) -> dict:
"""Extract with complete fallback hierarchy."""

# Level 1: Cache
cached = self.cache.fallback(text)
if cached:
return cached

# Level 2: Primary LLM
try:
output = llm_call(text, self.schema)
parsed = json.loads(output)
jsonschema.validate(parsed, self.schema)
self.cache.set(text, parsed)
return parsed
except Exception:
pass

# Level 3: Secondary LLM
try:
output = secondary_llm_call(text, self.schema)
parsed = json.loads(output)
jsonschema.validate(parsed, self.schema)
return parsed
except Exception:
pass

# Level 4: Partial response
partial = extract_partial(text)
if partial:
return partial

# Level 5: Escalate
self.escalation_queue.put({
"user_id": user_id,
"input": text,
"timestamp": datetime.now()
})

return None

Key Takeaways

  • Fallback hierarchies ensure graceful degradation: cache → secondary → template → partial → escalate.
  • Caching is the cheapest fallback, effective for repetitive tasks.
  • Secondary models add redundancy but increase cost; use for high-value tasks.
  • Partial responses are valuable for extraction; some data beats no data.
  • Always escalate unrecoverable failures rather than failing silently.
  • Log all fallback invocations to improve your primary system.

Frequently Asked Questions

How much should I invest in fallbacks?

For critical systems (e.g., payment processing), invest heavily: cache + secondary + escalation. For nice-to-have features (e.g., metadata), a simple template is enough.

Should I retry the primary LLM before falling back?

Yes, 1–3 retries (with corrective feedback) make sense. Beyond that, move to fallback. Retrying indefinitely wastes time and money.

How do I prevent stale cached responses?

Add TTL (time-to-live) to cached entries: cache[key] = (value, timestamp). Refresh cache entries periodically or when upstream data changes.

What if all fallbacks fail?

Return a clear error message to the user explaining what happened and next steps (e.g., "Please try again later"). Never silently fail.

Further Reading