Canary releases and A/B testing for LLM apps
Canary releases and A/B testing are deployment strategies that route a percentage of production traffic to a new model or prompt variant, observe quality metrics and user behavior, and gradually increase the percentage as confidence grows. A canary deployment might send 1% of requests to a new model, observe error rates and latency for 30 minutes, then increase to 5%, 25%, and finally 100%. A/B tests run two variants simultaneously (usually 50-50 split) and measure which performs better on metrics like user engagement, accuracy, or cost. Canary and A/B strategies reduce risk by validating changes on real users before full rollout and catching regressions that testing environments miss.
When to Use Canary vs. Blue-Green
Blue-green switches all traffic instantly. Canary gradually increases traffic percentage, catching regressions on a small user subset before they reach everyone. A/B testing compares two variants on equal traffic, measuring which is better. Use canary for model upgrades or prompt updates where you want to minimize blast radius. Use A/B testing when you need statistical evidence that one variant outperforms the other. Use blue-green for fast, high-confidence rollouts. In practice, many teams use canary first (1% to 5%) to catch obvious bugs, then A/B test (50-50) to measure quality, then blue-green (100%) for final rollout.
Implementing Canary Deployment
Set up traffic splitting so a percentage of requests go to the canary (new model) and the rest to the stable (current) version.
import random
import hashlib
def get_serving_version(user_id: str, canary_percentage: float) -> str:
"""Route request to canary or stable based on canary percentage."""
# Use user ID hash for consistent routing (same user always gets same version)
hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
if hash_value < canary_percentage:
return "canary" # new model
else:
return "stable" # current production model
def serve_inference(user_id: str, query: str, canary_percentage: float):
"""Serve inference with canary routing."""
version = get_serving_version(user_id, canary_percentage)
if version == "canary":
model = "claude-3-5-sonnet-v1.3.0" # new version
else:
model = "claude-3-5-sonnet-v1.2.3" # stable version
# Call model and log which version served the request
response = call_model(model, query)
log_request({
"user_id": user_id,
"version": version,
"model": model,
"latency_ms": response.latency,
"quality_score": response.quality,
"cost_cents": response.cost
})
return response
# Example: gradually increase canary percentage
canary_schedule = [
{"start_time": "2026-06-02T10:00:00Z", "percentage": 1},
{"start_time": "2026-06-02T10:30:00Z", "percentage": 5},
{"start_time": "2026-06-02T11:00:00Z", "percentage": 25},
{"start_time": "2026-06-02T12:00:00Z", "percentage": 100},
]
Canary Metrics and Rollback Triggers
Monitor metrics for the canary group and compare with the stable group. Define rollback thresholds: if canary metrics diverge significantly, roll back automatically.
import time
from dataclasses import dataclass
@dataclass
class MetricThresholds:
error_rate_delta: float = 0.01 # canary error rate can be 1% higher
latency_p99_delta_ms: float = 200 # p99 latency can be 200ms higher
quality_score_delta: float = -0.05 # quality can drop 5%
cost_delta_percent: float = 10 # cost can be 10% higher
min_sample_size: int = 100 # need 100 requests before deciding
def compare_canary_and_stable(thresholds: MetricThresholds) -> dict:
"""Compare metrics between canary and stable versions."""
canary_metrics = query_metrics("canary", last_n_minutes=10)
stable_metrics = query_metrics("stable", last_n_minutes=10)
# Skip comparison if sample size too small
if canary_metrics["sample_size"] < thresholds.min_sample_size:
return {"status": "insufficient_data", "should_rollback": False}
deltas = {
"error_rate": canary_metrics["error_rate"] - stable_metrics["error_rate"],
"latency_p99": canary_metrics["latency_p99_ms"] - stable_metrics["latency_p99_ms"],
"quality_score": canary_metrics["quality_score"] - stable_metrics["quality_score"],
"cost_percent": (
(canary_metrics["cost_per_request"] - stable_metrics["cost_per_request"])
/ stable_metrics["cost_per_request"] * 100
)
}
# Check thresholds
violations = []
if deltas["error_rate"] > thresholds.error_rate_delta:
violations.append(f"error_rate_delta: {deltas['error_rate']:.3f}")
if deltas["latency_p99"] > thresholds.latency_p99_delta_ms:
violations.append(f"latency_p99_delta: {deltas['latency_p99']:.0f}ms")
if deltas["quality_score"] < thresholds.quality_score_delta:
violations.append(f"quality_score_delta: {deltas['quality_score']:.3f}")
if deltas["cost_percent"] > thresholds.cost_delta_percent:
violations.append(f"cost_delta: {deltas['cost_percent']:.1f}%")
return {
"status": "ready_to_decide",
"should_rollback": len(violations) > 0,
"violations": violations,
"deltas": deltas
}
def execute_canary_rollback(reason: str):
"""Roll back canary by setting percentage to 0."""
print(f"CANARY ROLLBACK: {reason}")
update_canary_percentage(0)
send_alert(f"Canary rollback executed: {reason}")
A/B Testing Setup
A/B tests run two variants on equal traffic (50-50 split) and measure key metrics for each group. Unlike canary (which assumes new is better), A/B testing compares without bias.
import hashlib
def assign_ab_test_group(user_id: str, test_id: str) -> str:
"""Consistently assign user to A or B group based on user_id and test_id."""
# Hash to ensure consistent group assignment
group_hash = hashlib.md5(f"{user_id}-{test_id}".encode()).hexdigest()
value = int(group_hash, 16) % 100
return "A" if value < 50 else "B"
def serve_ab_test_request(user_id: str, query: str, test_id: str):
"""Serve inference in an A/B test context."""
group = assign_ab_test_group(user_id, test_id)
if group == "A":
model = "claude-3-5-sonnet-v1.2.3" # control (baseline)
temperature = 0.5
else:
model = "claude-3-5-sonnet-v1.3.0" # treatment (new version)
temperature = 0.5 # same prompt, same config
response = call_model(model, query, temperature=temperature)
# Log for analysis
log_ab_test_event({
"user_id": user_id,
"test_id": test_id,
"group": group,
"model": model,
"latency_ms": response.latency,
"accuracy": response.accuracy,
"user_satisfied": response.user_feedback # thumb up/down
})
return response
def analyze_ab_test(test_id: str, min_duration_hours: int = 24) -> dict:
"""Analyze A/B test results after minimum duration."""
data = query_ab_test_data(test_id, duration_hours=min_duration_hours)
# Compute metrics per group
group_a = {
"sample_size": data["A"]["count"],
"accuracy": data["A"]["accuracy_mean"],
"latency_p99": data["A"]["latency_p99_ms"],
"satisfaction": data["A"]["satisfaction_rate"]
}
group_b = {
"sample_size": data["B"]["count"],
"accuracy": data["B"]["accuracy_mean"],
"latency_p99": data["B"]["latency_p99_ms"],
"satisfaction": data["B"]["satisfaction_rate"]
}
# Statistical significance (simplified: check if difference is >3%)
accuracy_diff = abs(group_b["accuracy"] - group_a["accuracy"])
accuracy_significant = accuracy_diff > 0.03
satisfaction_diff = abs(group_b["satisfaction"] - group_a["satisfaction"])
satisfaction_significant = satisfaction_diff > 0.05
return {
"group_a": group_a,
"group_b": group_b,
"accuracy_winner": "B" if group_b["accuracy"] > group_a["accuracy"] else "A",
"accuracy_significant": accuracy_significant,
"satisfaction_winner": "B" if group_b["satisfaction"] > group_a["satisfaction"] else "A",
"recommendation": (
"Promote B" if accuracy_significant and satisfaction_significant
else "Inconclusive; extend test"
)
}
Combining Canary and A/B Testing
Run a canary deployment first (1-5%) to catch obvious bugs, then promote to A/B testing (50-50) if canary succeeds. This two-layer approach minimizes blast radius and gathers statistical evidence.
# Deployment pipeline: Canary → A/B → Full Rollout
stages:
- name: "Canary (1%)"
duration_minutes: 15
percentage: 1
rollback_trigger: "error_rate > 5%"
- name: "Canary (5%)"
duration_minutes: 15
percentage: 5
rollback_trigger: "error_rate > 3%"
- name: "A/B Test (50-50)"
duration_hours: 24
percentage: 50
rollback_trigger: "accuracy_diff < -0.05"
- name: "Full Rollout (100%)"
percentage: 100
duration_hours: "infinity" # monitor forever
Key Takeaways
- Canary releases gradually shift traffic from stable to new versions, minimizing blast radius for regressions.
- A/B tests run two variants on equal traffic and measure which performs better, requiring no assumption that new is better.
- Define rollback triggers: if canary error rate or latency diverges beyond thresholds, roll back automatically.
- Combine canary (1-5%) and A/B testing (50-50) for layered validation before full rollout.
- Use consistent hashing (based on user ID) to ensure users stay in the same variant across multiple requests.
Frequently Asked Questions
How do I avoid biasing A/B test results by showing both variants to the same user?
Assign users to groups once based on a hash of their ID. They stay in the same group for the entire test duration. This ensures each user is either in group A or B, not both, and avoids treatment interaction effects.
What is the minimum sample size for statistical significance in an A/B test?
For LLM quality metrics, aim for 1,000+ samples per group (2,000 total) and a 24-hour minimum duration. Larger sample sizes reduce noise and improve statistical confidence. Use a power calculator: if you want to detect a 5% improvement with 80% power, compute the required sample size before launching the test.
Can I run multiple A/B tests simultaneously on the same users?
Yes, if tests are independent (different features). Use orthogonal hashing to ensure non-overlapping group assignments: group = hash(user_id, test_id_1) % 2. Document test interactions so you don't accidentally run correlated tests.
Should canary percentage increase automatically or manually?
Automate if you have robust monitoring. Define thresholds (e.g., error rate, latency, quality), and increase percentage every 15 minutes if all thresholds pass. Manual control is safer for high-risk deployments; automate for high-confidence releases.
What should I do if canary or A/B test inconclusive (no clear winner)?
Extend the test duration (more data improves signal) or analyze by subgroup (e.g., different user segments). If still inconclusive, go with business rules: pick the lower-cost variant, or stick with the stable version and refine the new variant offline.