Skip to main content

Canary Rollouts for Prompts: Gradual Safe Deployments

A canary rollout is a deployment strategy where you gradually shift traffic from the old prompt (or old model) to a new one, starting with a small percentage (e.g., 1%), monitoring key metrics, and increasing the percentage only if metrics stay healthy. If problems appear, you rollback immediately to the old version. Canary rollouts reduce risk by catching bugs early and in a limited blast radius.

Named after the "canary in the coal mine" (canaries were used to detect toxic gas before humans died), a canary prompt is a early warning system. If the canary fails, you know there's danger ahead before the whole flock is affected.

Canary vs. A/B Test

A/B tests run two versions simultaneously for a fixed period, then pick a winner. Canaries run one version at a time but increase traffic gradually, watching for problems. Canaries are faster and simpler (no complex statistical analysis), but give you less statistical power. Use canaries when you're confident a change is good; use A/B tests when you're uncertain.

Canary Strategy: Stages

Design your canary rollout with explicit stages and success criteria:

# prompts/canary-plan.yaml
canary_id: "customer-support-v2.0.0-canary"
prompt_name: "customer-support"
new_version: "2.0.0"
old_version: "1.9.0"
created_at: "2026-06-01T10:00:00Z"

stages:
- stage: "1"
traffic_percentage: 1
duration_minutes: 30
success_criteria:
error_rate_pct: { max: 1.0 }
latency_p99_ms: { max: 3000 }
customer_satisfaction: { min: 4.0 }

- stage: "2"
traffic_percentage: 5
duration_minutes: 60
success_criteria:
error_rate_pct: { max: 1.0 }
latency_p99_ms: { max: 3000 }
customer_satisfaction: { min: 3.95 }

- stage: "3"
traffic_percentage: 25
duration_minutes: 120
success_criteria:
error_rate_pct: { max: 1.5 }
latency_p99_ms: { max: 3500 }
customer_satisfaction: { min: 3.9 }

- stage: "4"
traffic_percentage: 100
duration_minutes: 0
success_criteria: {} # No time limit; stay here forever

Each stage has a traffic percentage, a duration, and success criteria. After the stage duration, if metrics are healthy, advance to the next stage. If metrics degrade, rollback immediately.

Implementing Canary Logic

Route traffic based on canary stage:

import random
from datetime import datetime, timedelta

class CanaryRouter:
def __init__(self, registry, metrics_store):
self.registry = registry
self.metrics = metrics_store

def get_prompt_for_user(self, user_id: str, prompt_name: str) -> dict:
"""
Get the appropriate prompt version (canary or stable) for this user.
Return: {"version": "2.0.0", "is_canary": true}
"""
canary = self.registry.get_active_canary(prompt_name)

if canary is None:
# No active canary; use stable version
version = self.registry.get_stable_version(prompt_name)
return {"version": version, "is_canary": False}

# Canary is active; decide if this user gets it
current_stage = self.registry.get_canary_stage(canary["id"])
traffic_percentage = current_stage["traffic_percentage"]

# Deterministic: user is always routed the same way
user_hash = hash(f"{user_id}:{prompt_name}") % 100

if user_hash < traffic_percentage:
version = canary["new_version"]
is_canary = True
else:
version = canary["old_version"]
is_canary = False

return {"version": version, "is_canary": is_canary}

def check_stage_health(self, canary_id: str) -> dict:
"""
Check if the current canary stage is healthy.
Return: {"healthy": bool, "violations": []}
"""
canary = self.registry.get_canary(canary_id)
stage = self.registry.get_canary_stage(canary_id)

# Fetch metrics for the canary version
metrics = self.metrics.fetch_metrics(
prompt_version=canary["new_version"],
time_window_minutes=10 # Last 10 minutes
)

violations = []

for criterion, threshold in stage["success_criteria"].items():
actual = metrics.get(criterion)

if "max" in threshold and actual > threshold["max"]:
violations.append(f"{criterion}: {actual} exceeds {threshold['max']}")

if "min" in threshold and actual < threshold["min"]:
violations.append(f"{criterion}: {actual} below {threshold['min']}")

return {
"healthy": len(violations) == 0,
"violations": violations,
"metrics": metrics
}

def advance_or_rollback(self, canary_id: str):
"""
Check stage health. If healthy, advance to next stage.
If unhealthy, rollback to the old version.
"""
canary = self.registry.get_canary(canary_id)
stage_idx = self.registry.get_canary_stage_index(canary_id)

health = self.check_stage_health(canary_id)

if health["healthy"]:
if stage_idx < len(canary["stages"]) - 1:
# Advance to next stage
next_stage = stage_idx + 1
self.registry.set_canary_stage(canary_id, next_stage)
print(f"Canary {canary_id}: advanced to stage {next_stage}")
else:
# Reached 100%; canary is complete
self.registry.promote_canary_to_stable(canary_id)
print(f"Canary {canary_id}: promotion complete")
else:
# Unhealthy; rollback
self.registry.rollback_canary(canary_id)
self._alert_team(f"Canary {canary_id} rolled back: {health['violations']}")
print(f"Canary {canary_id}: rolled back due to: {health['violations']}")

# Usage: run canary health checks periodically (e.g., every 5 minutes)
canary_mgr = CanaryRouter(registry, metrics)

# Every 5 minutes:
canary_mgr.advance_or_rollback(canary_id="customer-support-v2.0.0-canary")

Metrics to Monitor During Canaries

Key metrics to track:

MetricThresholdWhy
Error rate (%)< 1.0%Detects crashes, timeouts, exceptions
Latency p99 (ms)< 3000 msDetects performance regressions
Customer satisfaction> 3.9 (on 5-scale)Detects output quality regressions
Token usage< 110% of baselineDetects cost spikes
Hallucination rate (%)< 5%Detects factual errors
Refund false-positive rate< 5%Domain-specific (customer support)
import time

class MetricsCollector:
def __init__(self, event_store):
self.event_store = event_store

def compute_canary_metrics(self, prompt_version: str,
time_window_minutes: int = 10) -> dict:
"""Compute key metrics for a canary version."""
now = time.time()
start_time = now - (time_window_minutes * 60)

events = self.event_store.fetch_events(
prompt_version=prompt_version,
start_time=start_time,
end_time=now
)

# Calculate metrics
total = len(events)
errors = len([e for e in events if e.get("error")])
satisfactions = [e.get("satisfaction") for e in events if "satisfaction" in e]
latencies = [e.get("latency_ms") for e in events if "latency_ms" in e]

return {
"error_rate_pct": (errors / total * 100) if total > 0 else 0,
"latency_p99_ms": sorted(latencies)[int(0.99 * len(latencies))] if latencies else 0,
"customer_satisfaction": sum(satisfactions) / len(satisfactions) if satisfactions else 0,
"sample_size": total
}

Automated Canary Monitoring

Run a background job to monitor canaries and auto-advance or rollback:

import asyncio
from apscheduler.schedulers.asyncio import AsyncIOScheduler

class CanaryMonitor:
def __init__(self, canary_router, alert_channel):
self.router = canary_router
self.alert = alert_channel

async def monitor_canaries(self):
"""Run every 5 minutes; check all active canaries."""
scheduler = AsyncIOScheduler()

async def check_all_canaries():
canaries = self.router.registry.get_all_active_canaries()

for canary in canaries:
health = self.router.check_stage_health(canary["id"])

if not health["healthy"]:
self.alert.send(f"""
CANARY ALERT: {canary['id']}
Version: {canary['new_version']}
Violations: {', '.join(health['violations'])}
Metrics: {health['metrics']}
""")
self.router.advance_or_rollback(canary["id"])
else:
# Check if stage duration elapsed
stage = self.router.registry.get_canary_stage(canary["id"])
started_at = self.router.registry.get_stage_start_time(canary["id"])

if started_at + (stage["duration_minutes"] * 60) < time.time():
self.router.advance_or_rollback(canary["id"])

scheduler.add_job(check_all_canaries, 'interval', minutes=5)
scheduler.start()

Manual Overrides and Emergency Rollback

Allow operators to manually rollback a canary at any time:

class EmergencyRollback:
def __init__(self, registry, alert_channel):
self.registry = registry
self.alert = alert_channel

def emergency_rollback(self, canary_id: str, reason: str, operator: str):
"""Immediately rollback a canary; requires 1 approver."""
canary = self.registry.get_canary(canary_id)

# Log the action
self.registry.log_rollback(
canary_id=canary_id,
reason=reason,
operator=operator,
timestamp=datetime.now().isoformat()
)

# Revert traffic routing to old version
self.registry.rollback_canary(canary_id)

# Alert the team
self.alert.send(f"""
EMERGENCY ROLLBACK: {canary_id}
Operator: {operator}
Reason: {reason}
Reverted to: {canary['old_version']}
""")

Key Takeaways

  • Canary rollouts gradually shift traffic to a new prompt version, catching problems early.
  • Design stages with explicit traffic percentages, durations, and success criteria.
  • Monitor error rate, latency, and domain-specific metrics (satisfaction, false-positive rate).
  • Automate health checks; advance or rollback based on metrics.
  • Allow manual emergency rollback; log all actions for auditing.

Frequently Asked Questions

How fast should I increase traffic in stages?

Tailor to your traffic volume. High-traffic services can use: 1% (30 min) then 5% (1 hr) then 25% (2 hr) then 100%. Low-traffic services might use: 5% (4 hrs) then 50% (8 hrs) then 100%.

Can I run multiple canaries simultaneously?

Yes, on different prompt names. E.g., one canary for customer-support, another for content-moderator. Avoid overlapping canaries on the same prompt; they'll interfere.

What if I want to rollback at 50% traffic?

Allowed. Just set a rollback reason ("metrics degraded at 50%") and revert to the old version. No penalty.

Should I run canaries or A/B tests?

Use canaries when you're confident (e.g., minor refinement). Use A/B tests when you're uncertain. You can combine them: run a canary of a statistically validated variant.

How do I measure statistical significance during a canary?

Canaries don't give you statistical significance; they give you rapid detection of large regressions. For smaller effects, run a concurrent A/B test or wait until the canary is at 100% and gather data over time.

Further Reading