Safe Rollback Strategies: Reverting Problematic Prompts
A rollback is the process of reverting a prompt (or other system change) to a previous known-good version when a problem is detected. Rollbacks are your insurance policy: if a new prompt causes a 10% drop in accuracy or a 2x latency spike, you revert in minutes, minimizing customer impact. Without a rollback procedure, you're stuck debugging while users suffer.
The goal of a rollback strategy is speed: from "we detected a problem" to "we're running the old version" in less than 5 minutes, ideally less than 1 minute. This requires automation, pre-planning, and clear decision rules.
Rollback Triggers: When to Revert
Define explicit criteria for when a rollback is mandatory, recommended, or optional:
| Trigger | Severity | Action | Who Decides | Time Limit |
|---|---|---|---|---|
| Error rate > 5% | Critical | Auto-rollback | System | < 2 min |
| Error rate 2–5% | High | Page on-call; prompt decision | On-call | < 5 min |
| Quality metric drops > 20% | High | Alert team; recommend rollback | Product PM | < 30 min |
| Quality metric drops 10–20% | Medium | Alert; observe for 1 hour | Product PM | < 1 hour |
| Latency p99 doubles | High | Page on-call | On-call | < 10 min |
Implement triggers in code:
class RollbackManager:
"""Detect problems and execute rollback."""
def __init__(self, registry, metrics, notifier):
self.registry = registry
self.metrics = metrics
self.notifier = notifier
def check_rollback_criteria(self, prompt_version: str) -> dict:
"""
Check if rollback criteria are met.
Return: {"should_rollback": bool, "reason": str, "severity": str}
"""
# Fetch recent metrics
metrics = self.metrics.get_recent_metrics(prompt_version, time_window_min=5)
error_rate = metrics.get("error_rate_pct")
quality_drop = metrics.get("quality_drop_pct")
latency_increase = metrics.get("latency_increase_pct")
# Check rollback criteria
if error_rate > 5.0:
return {
"should_rollback": True,
"reason": f"Error rate {error_rate:.1f}% exceeds 5% threshold",
"severity": "critical",
"auto": True
}
if error_rate > 2.0:
return {
"should_rollback": True,
"reason": f"Error rate {error_rate:.1f}% exceeds 2% threshold",
"severity": "high",
"auto": False # Requires manual approval
}
if quality_drop > 20.0:
return {
"should_rollback": True,
"reason": f"Quality dropped {quality_drop:.1f}%",
"severity": "high",
"auto": False
}
if latency_increase > 100.0: # 2x
return {
"should_rollback": True,
"reason": f"Latency increased {latency_increase:.1f}%",
"severity": "high",
"auto": False
}
return {"should_rollback": False, "reason": "All metrics healthy"}
def execute_rollback(self, prompt_name: str, previous_version: str,
reason: str, operator: str = "system"):
"""Execute immediate rollback."""
current_version = self.registry.get_active_version(prompt_name)
# Log the rollback
self.registry.log_rollback(
prompt_name=prompt_name,
from_version=current_version,
to_version=previous_version,
reason=reason,
operator=operator,
timestamp=datetime.now().isoformat()
)
# Update routing to serve the old version
self.registry.set_active_version(prompt_name, previous_version)
# Notify the team
self.notifier.alert(
title=f"ROLLBACK: {prompt_name}",
message=f"Reverted from {current_version} to {previous_version}\nReason: {reason}",
severity="critical",
channels=["#incidents", "on-call"]
)
return {
"success": True,
"from_version": current_version,
"to_version": previous_version,
"timestamp": datetime.now().isoformat()
}
# Usage: background job (runs every 30 seconds)
mgr = RollbackManager(registry, metrics, notifier)
async def rollback_check_loop():
prompt_name = "customer-support"
current_version = registry.get_active_version(prompt_name)
criteria = mgr.check_rollback_criteria(current_version)
if criteria["should_rollback"]:
if criteria["auto"]:
# Auto-rollback (critical)
previous_version = registry.get_previous_version(prompt_name)
mgr.execute_rollback(prompt_name, previous_version, criteria["reason"])
else:
# Alert on-call for manual decision
mgr.notifier.page_oncall(f"Rollback recommended: {criteria['reason']}")
Rollback Decision Tree
Create a decision tree to help operators decide whether to rollback:
START: Is error rate or latency spiking?
├─ YES: Go to Severity Check
│ ├─ Severity = CRITICAL (error rate > 5%)?
│ │ └─ → AUTO-ROLLBACK (within 2 minutes)
│ ├─ Severity = HIGH (error rate 2–5%, latency 2x)?
│ │ └─ → PAGE ON-CALL: Recommend rollback
│ │ ├─ Operator confirms? → EXECUTE ROLLBACK
│ │ └─ Operator says "investigate"? → Go to Investigation
│ └─ Severity = MEDIUM (10–20% quality drop)?
│ └─ → ALERT TEAM: Monitor for 1 hour
│ ├─ Issue persists? → EXECUTE ROLLBACK
│ └─ Issue resolves? → MONITOR
└─ NO: All metrics normal
└─ → MONITOR (no action)
Executing Rollback: Step by Step
- Confirm the problem. Check metrics; verify the issue is real.
- Identify the previous good version. Use deployment logs; test if necessary.
- Update routing. Switch traffic to the old version (should be instant).
- Verify the rollback. Check that metrics recover within 5 minutes.
- Post-mortem. Document what happened; plan preventive measures.
async def execute_rollback_with_verification(
prompt_name: str,
target_version: str,
mgr: RollbackManager
):
"""Rollback with verification steps."""
print(f"Step 1: Confirming problem...")
criteria = mgr.check_rollback_criteria(mgr.registry.get_active_version(prompt_name))
if not criteria["should_rollback"]:
print("Problem not confirmed. Aborting rollback.")
return
print(f"Step 2: Executing rollback to {target_version}...")
result = mgr.execute_rollback(
prompt_name=prompt_name,
previous_version=target_version,
reason=criteria["reason"],
operator="[email protected]"
)
print(f"Step 3: Waiting for metrics to recover...")
for i in range(12): # Check every 5 seconds for 60 seconds
await asyncio.sleep(5)
new_metrics = mgr.metrics.get_recent_metrics(target_version, time_window_min=1)
error_rate = new_metrics.get("error_rate_pct")
print(f" Check {i+1}: error_rate = {error_rate:.1f}%")
if error_rate < 1.0:
print("Step 4: Metrics recovered. Rollback successful.")
return result
print("Step 5: Metrics not recovering. Escalate to oncall.")
mgr.notifier.escalate()
Preventing Bad Rollouts (Upstream)
The best rollback is one you never need. Prevent bad rollouts:
- Test in staging for 24 hours before promoting to production.
- Use canary rollouts (1% → 5% → 25% → 100%) instead of full rollouts.
- Set strict A/B test criteria before promoting (p < 0.05, no regressions).
- Monitor new versions closely for the first hour (shorten alert thresholds).
# Stricter alert thresholds for new versions (first 1 hour)
def get_alert_thresholds(prompt_version: str, age_minutes: int) -> dict:
"""Adjust alert thresholds based on version age."""
if age_minutes < 60:
# New version: stricter thresholds
return {
"error_rate_max": 0.5, # < 0.5%
"latency_p99_increase_pct": 10, # < 10% increase
"quality_drop_pct": 5 # < 5% drop
}
else:
# Mature version: normal thresholds
return {
"error_rate_max": 2.0, # < 2%
"latency_p99_increase_pct": 20, # < 20% increase
"quality_drop_pct": 10 # < 10% drop
}
Rollback Playbooks
Create runbooks for common failure scenarios:
# Runbook: Accuracy Dropped 20% After Prompt Update
## Severity: HIGH
### 1. Confirm the Problem
- Check dashboard: Refund approval accuracy
- Expected: >= 75%
- Actual: <= 60%? → Proceed with rollback
### 2. Identify Previous Good Version
- Current: customer-support:v2.0.0 (deployed 2026-06-01 10:00 UTC)
- Previous: customer-support:v1.9.0 (deployed 2026-05-20 09:00 UTC)
- Run baseline test on v1.9.0: if accuracy >= 75%, proceed
### 3. Execute Rollback
```bash
curl -X POST http://localhost:8000/api/rollback \
-d '{"prompt": "customer-support", "target_version": "v1.9.0"}' \
-H "Content-Type: application/json"
4. Verify Recovery
- Wait 5 minutes
- Check dashboard again
- Accuracy back above 70%? → Rollback successful
- Accuracy still low? → Escalate to @oncall
5. Post-Mortem
- What changed in v2.0.0 caused the regression?
- Did it pass A/B tests? (If yes, why didn't tests catch it?)
- How do we prevent this next time?
## Gradual Rollback (for minor regressions)
For minor regressions that don't warrant instant full rollback, use a gradual rollback (reverse of canary):
```python
class GradualRollback:
def __init__(self, registry, metrics):
self.registry = registry
self.metrics = metrics
def gradual_rollback(self, prompt_name: str, current_version: str,
target_version: str):
"""Gradually shift traffic from current to target version."""
stages = [
{"percent": 75, "duration_min": 10}, # 75% old, 25% new
{"percent": 50, "duration_min": 10}, # 50/50
{"percent": 25, "duration_min": 10}, # 25% old, 75% new
{"percent": 0, "duration_min": 0} # 100% old (complete)
]
for stage in stages:
# Set traffic split
self.registry.set_traffic_split(
prompt_name,
current_version,
percent=stage["percent"]
)
# Monitor for the duration
for minute in range(stage["duration_min"]):
metrics = self.metrics.get_recent_metrics(
prompt_name, time_window_min=1
)
if metrics["error_rate_pct"] > 2.0:
# Metrics degraded; pause and escalate
return False
time.sleep(60)
print(f"Rollback stage complete: {stage['percent']}% old")
return True
Key Takeaways
- Rollback is the insurance policy; make it automatic and fast for critical failures.
- Define rollback triggers (error rate, quality drop, latency) upfront.
- Aim for sub-5-minute rollback time; use feature flags or routing changes for speed.
- Verify rollback by checking metrics; escalate if recovery fails.
- Prevent bad rollouts with staging, canary rollouts, and strict A/B test criteria.
Frequently Asked Questions
How long should I keep old versions available for rollback?
Keep the last 2–3 versions in production immediately available (zero-latency rollback). Keep 6 months of older versions archived in case post-mortem investigation reveals a regression from months ago.
Should I automate rollback or require human approval?
Automate rollback for critical failures (error rate > 5%, cascading failures). Require human approval for minor issues (quality down 10–20%, latency up 20–50%).
Can I rollback to a version other than the immediate previous?
Yes, if that version is known-good. E.g., if v2.0.0 is broken, you can rollback to v1.8.0 instead of v1.9.0. Just verify v1.8.0 works first.
What if a rollback itself causes problems?
Rare but possible. Treat a failed rollback as a critical incident; page the on-call team immediately. Have a contact list ready to reach the original developers.
How do I coordinate rollbacks across multiple prompts?
Use a central incident coordination system (e.g., Slack channel, PagerDuty incident). If one prompt rollback is needed, check if related prompts need rollback too.
Further Reading
- Incident Response and Postmortems — Google SRE guide to incident handling.
- Feature Flags for Safe Deployments — Using feature flags to enable fast rollback.
- Chaos Engineering: Reliability Testing — Proactively test failure scenarios before they happen in production.
- AWS Lambda Gradual Deployment — Real-world deployment strategies.