Regression Detection: Prevent Model Rollback Scenarios
Regression detection is the practice of automatically catching quality degradation after a deploy (new model, new prompt, new API client, etc.) and triggering rollback or investigation before widespread user impact. Unlike general monitoring (which detects issues anytime), regression detection specifically compares versions and prevents the introduction of new problems. This article covers A/B testing in production, canary deployments, rollback automation, and quality gates in CI/CD pipelines.
Why regressions are especially dangerous in LLMs
A regression in an LLM system is when a new version performs worse than the old version on average (even if better on some cases). Regressions can be:
- Prompt regressions: A tweaked system prompt inadvertently breaks certain use cases.
- Model regressions: A fine-tuned model performs worse on out-of-distribution queries.
- Infrastructure regressions: An API client change introduces latency or timeout issues.
Unlike traditional software, LLM regressions are insidious: the code may be syntactically correct, tests may pass, but generated outputs are subjectively worse. Without systematic regression detection, bad changes slip into production and degrade user experience until someone manually complains or dashboards are reviewed.
A/B testing and canary deployments
Canary deployment: Deploy the new version to a small percentage of traffic (5-10%), measure quality, then gradually increase to 100% if metrics look good. Revert if metrics degrade.
A/B test: Run both versions in parallel on the same traffic, randomly assigning users to old or new. Measure quality gap. Only move forward if new version is better (or at least not worse).
# Canary deployment control
class CanaryDeployer:
def __init__(self, version_router, evaluator, metrics_db):
self.version_router = version_router
self.evaluator = evaluator
self.metrics_db = metrics_db
def run_canary(self, old_version, new_version, canary_percentage=5, duration_minutes=60):
"""
Deploy new version to canary_percentage of traffic for duration_minutes.
Measure quality and decide whether to proceed or rollback.
"""
# Route canary_percentage of traffic to new_version
self.version_router.set_split({
old_version: 100 - canary_percentage,
new_version: canary_percentage
})
# Collect metrics for the duration
import time
start_time = time.time()
end_time = start_time + duration_minutes * 60
old_version_metrics = []
new_version_metrics = []
while time.time() < end_time:
# Collect a batch of results
recent_results = self.metrics_db.query_recent(minutes=1)
for result in recent_results:
if result["version"] == old_version:
old_version_metrics.append(result["eval_scores"]["overall"])
elif result["version"] == new_version:
new_version_metrics.append(result["eval_scores"]["overall"])
time.sleep(10) # Check every 10 seconds
# Analyze results
old_mean = np.mean(old_version_metrics) if old_version_metrics else 0.5
new_mean = np.mean(new_version_metrics) if new_version_metrics else 0.5
# Statistical test
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(old_version_metrics, new_version_metrics)
# Decision: proceed if new version is not significantly worse
regression_threshold = 0.01 # 1% regression threshold
regression_detected = (old_mean - new_mean) / (old_mean + 1e-8) > regression_threshold
result = {
"old_version": old_version,
"new_version": new_version,
"old_mean_score": old_mean,
"new_mean_score": new_mean,
"score_delta": new_mean - old_mean,
"p_value": p_value,
"regression_detected": regression_detected,
"recommendation": "ROLLBACK" if regression_detected else "PROCEED"
}
if regression_detected:
# Rollback to old version
self.version_router.set_split({
old_version: 100,
new_version: 0
})
result["action"] = "ROLLED_BACK"
else:
# Gradually increase canary to 100%
result["action"] = "PROCEEDING_TO_FULL_ROLLOUT"
return result
# Example
deployer = CanaryDeployer(version_router, evaluator, metrics_db)
canary_result = deployer.run_canary(
old_version="v2.1.0",
new_version="v2.2.0-rc1",
canary_percentage=5,
duration_minutes=30
)
print(f"Canary result: {canary_result}")
Regression testing in CI/CD
Before deploying, run a battery of regression tests in CI. These compare the candidate version against the baseline (current production) on a fixed test set.
# CI/CD regression testing
class RegressionTestSuite:
def __init__(self, baseline_model, test_set):
self.baseline_model = baseline_model
self.test_set = test_set
def run_regression_tests(self, candidate_model, regression_threshold=0.02):
"""
Run candidate version against baseline on test set.
Return pass/fail based on regression threshold.
"""
baseline_scores = []
candidate_scores = []
for prompt in self.test_set:
# Run baseline
baseline_output = self.baseline_model.generate(prompt)
baseline_score = self.evaluate_output(prompt, baseline_output)
baseline_scores.append(baseline_score)
# Run candidate
candidate_output = candidate_model.generate(prompt)
candidate_score = self.evaluate_output(prompt, candidate_output)
candidate_scores.append(candidate_score)
# Compute statistics
baseline_mean = np.mean(baseline_scores)
candidate_mean = np.mean(candidate_scores)
regression_magnitude = (baseline_mean - candidate_mean) / (baseline_mean + 1e-8)
# Statistical test
from scipy.stats import wilcoxon
stat, p_value = wilcoxon(baseline_scores, candidate_scores)
# Pass/fail decision
test_passed = regression_magnitude < regression_threshold or p_value > 0.05
report = {
"baseline_mean": baseline_mean,
"candidate_mean": candidate_mean,
"regression_magnitude": regression_magnitude,
"p_value": p_value,
"threshold": regression_threshold,
"test_passed": test_passed,
"recommendation": "APPROVE" if test_passed else "REJECT"
}
if not test_passed:
report["reasons"] = [
f"Regression of {regression_magnitude:.1%} exceeds threshold {regression_threshold:.1%}"
]
return report
def evaluate_output(self, prompt, output):
"""
Placeholder for evaluating quality. Replace with your evaluator.
"""
return 0.8 # Dummy score
# Example
test_set = ["What is AI?", "Explain ML", "How does NLP work?"]
test_suite = RegressionTestSuite(baseline_model, test_set)
report = test_suite.run_regression_tests(candidate_model)
print(f"Regression test report: {report}")
# Use in CI/CD
if not report["test_passed"]:
print("FAILED: Candidate version introduced regression. Blocking deployment.")
exit(1)
Per-cohort regression detection
Different cohorts (user tiers, domains, languages) may be affected differently by a change. Detect regressions per cohort:
# Per-cohort regression analysis
class CohortRegressionAnalyzer:
def __init__(self, baseline_model, test_set_by_cohort):
self.baseline_model = baseline_model
self.test_set_by_cohort = test_set_by_cohort
def detect_regressions_per_cohort(self, candidate_model, regression_threshold=0.02):
"""
Test candidate version separately for each cohort.
Identify if regression is cohort-specific or global.
"""
cohort_results = {}
global_regression = False
for cohort, test_set in self.test_set_by_cohort.items():
baseline_scores = []
candidate_scores = []
for prompt in test_set:
baseline_output = self.baseline_model.generate(prompt)
baseline_score = self.evaluate(prompt, baseline_output)
baseline_scores.append(baseline_score)
candidate_output = candidate_model.generate(prompt)
candidate_score = self.evaluate(prompt, candidate_output)
candidate_scores.append(candidate_score)
baseline_mean = np.mean(baseline_scores)
candidate_mean = np.mean(candidate_scores)
regression = (baseline_mean - candidate_mean) / (baseline_mean + 1e-8)
cohort_results[cohort] = {
"baseline_mean": baseline_mean,
"candidate_mean": candidate_mean,
"regression": regression,
"failed": regression > regression_threshold
}
if cohort_results[cohort]["failed"]:
global_regression = True
return {
"cohort_results": cohort_results,
"global_regression": global_regression,
"recommendation": "REJECT" if global_regression else "APPROVE"
}
Automated rollback triggers
Define automated rules for when to rollback without waiting for human decision:
# Automated rollback trigger
class AutomaticRollbackPolicy:
def __init__(self, version_router):
self.version_router = version_router
def check_rollback_criteria(self, metrics):
"""
metrics: dict of current production metrics
Returns whether to rollback and reason.
"""
reasons_to_rollback = []
# Criterion 1: Quality drop exceeds threshold
if metrics.get("quality_drop_pct", 0) > 5:
reasons_to_rollback.append(f"Quality dropped {metrics['quality_drop_pct']:.1f}%")
# Criterion 2: Error rate spike
if metrics.get("error_rate", 0) > 0.05:
reasons_to_rollback.append(f"Error rate at {metrics['error_rate']:.1%}")
# Criterion 3: Latency regression
if metrics.get("latency_p95_ms", 0) > 1000:
reasons_to_rollback.append(f"P95 latency at {metrics['latency_p95_ms']}ms")
# Criterion 4: Safety regression
if metrics.get("safety_violation_rate", 0) > 0.001:
reasons_to_rollback.append(f"Safety violations at {metrics['safety_violation_rate']:.2%}")
should_rollback = len(reasons_to_rollback) > 0
return {
"should_rollback": should_rollback,
"reasons": reasons_to_rollback,
"triggered_at": datetime.utcnow().isoformat()
}
def execute_rollback(self, current_version, previous_version):
"""
Rollback to previous version and alert on-call.
"""
self.version_router.set_split({
previous_version: 100,
current_version: 0
})
# Alert on-call
# alert_service.page_oncall(f"Automatic rollback: {current_version} -> {previous_version}")
return {
"action": "ROLLED_BACK",
"from_version": current_version,
"to_version": previous_version,
"timestamp": datetime.utcnow().isoformat()
}
# Example
policy = AutomaticRollbackPolicy(version_router)
decision = policy.check_rollback_criteria({
"quality_drop_pct": 8.5,
"error_rate": 0.02,
"latency_p95_ms": 450
})
if decision["should_rollback"]:
print(f"Triggering rollback: {decision['reasons']}")
result = policy.execute_rollback("v2.2.0", "v2.1.0")
Regression prevention: pre-deploy validation
Catch regressions before production by running comprehensive tests in CI:
# Comprehensive pre-deploy validation pipeline
class PreDeployValidation:
def __init__(self, evaluator, test_set):
self.evaluator = evaluator
self.test_set = test_set
def validate_before_deploy(self, candidate_model, old_version):
"""
Run full validation suite before allowing deploy.
"""
checks = {
"unit_tests": self.run_unit_tests(candidate_model),
"regression_tests": self.run_regression_tests(candidate_model, old_version),
"edge_case_tests": self.run_edge_case_tests(candidate_model),
"safety_tests": self.run_safety_tests(candidate_model),
"performance_tests": self.run_performance_tests(candidate_model)
}
# Aggregate results
all_passed = all(check.get("passed", False) for check in checks.values())
report = {
"candidate_version": candidate_model.version,
"old_version": old_version,
"all_checks_passed": all_passed,
"checks": checks,
"recommendation": "DEPLOY" if all_passed else "REJECT",
"blockers": [k for k, v in checks.items() if not v.get("passed", False)]
}
return report
def run_unit_tests(self, model):
# Run unit tests (placeholder)
return {"passed": True, "count": 42}
def run_regression_tests(self, model, old_version):
# Run regression tests (placeholder)
return {"passed": True, "regression_detected": False}
def run_edge_case_tests(self, model):
# Test on edge cases and adversarial examples
return {"passed": True, "edge_cases_handled": 99}
def run_safety_tests(self, model):
# Test safety and compliance
return {"passed": True, "violations": 0}
def run_performance_tests(self, model):
# Test latency and throughput
return {"passed": True, "p95_latency_ms": 250}
# Example
validator = PreDeployValidation(evaluator, test_set)
validation_report = validator.validate_before_deploy(candidate_model, "v2.1.0")
if not validation_report["all_checks_passed"]:
print(f"Deployment blocked: {validation_report['blockers']}")
exit(1)
else:
print("All checks passed. Proceeding with deployment.")
Key Takeaways
- Regression detection prevents new versions from degrading production quality.
- Use canary deployments (5-10% of traffic) and A/B tests (parallel versions) to measure quality delta.
- Run comprehensive regression tests in CI before allowing deploys.
- Detect regressions per cohort (user tier, domain, language) to identify cohort-specific issues.
- Automate rollbacks based on thresholds (quality, error rate, latency, safety).
Frequently Asked Questions
What's a reasonable regression threshold for deploying?
1-3% average quality drop is often acceptable if the change is important. Safety metrics (toxicity, refusals) require tighter thresholds (< 0.5% regression).
Should I run canary tests in parallel with CI regression tests, or sequentially?
Run CI tests first (fast, catches obvious issues). If CI passes, proceed to canary deployment. Canary is the final validation before full rollout.
How long should a canary run before making a rollback decision?
30 minutes to 1 hour is typical for detecting issues. Longer canaries (hours/days) give more statistical power but delay feedback. Start with 30 minutes.
Can I use past A/B test results to train a predictor of "good deploys"?
Yes. Collect features (code diff size, changed files, test coverage) and outcomes (regression detected or not). Train a classifier to predict deployment risk.
What if a regression is minor but affects many users?
Take the total impact into account: regression_magnitude * affected_user_percentage. A 1% regression affecting 50% of users is more serious than a 5% regression affecting 1% of users.