Skip to main content

Blue-green deployments for AI features

Blue-green deployment is a release strategy where you maintain two identical production environments: blue (current) and green (new). When you release a new model or prompt, you deploy to the inactive environment (green), validate it, then switch all traffic to it in seconds. If something goes wrong, you switch back to blue instantly—no rollback overhead. Blue-green deployments minimize downtime, reduce blast radius (only your feature is affected, not other services), and enable rapid recovery. For LLM applications, blue-green is ideal because you can A/B test between versions before fully switching.

Architecture: Blue and Green Environments

Set up two separate deployments with identical configuration but different model versions or prompts. Both environments connect to the same database and external APIs, but are isolated at the inference layer.

┌─────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (routes traffic to active env) │
└────────┬──────────────────────────────────────┬────────┘
│ │
┌────▼─────┐ ┌────▼──────┐
│ BLUE │ │ GREEN │
│ (active) │ │ (inactive)│
└────┬─────┘ └────┬──────┘
│ │
Model v1.2.3 Model v1.3.0
Prompt v4 Prompt v4

Instance: API Server + Model Instance: API Server + Model
Database: Shared Database: Shared

Route traffic through a load balancer or reverse proxy that directs requests to the active environment. When deploying a new version, you deploy to green, warm it up, validate quality, then switch the load balancer to send traffic to green (green becomes blue). The old blue is now inactive (green) and remains running for quick rollback.

Deployment Process: Five Steps

Step 1: Deploy to Inactive Environment

Deploy the new model and prompt to the green environment. This is non-disruptive; existing users are unaffected.

# Deploy to green (currently inactive)
kubectl apply -f k8s/green-deployment.yaml

# Wait for green to be ready
kubectl rollout status deployment/llm-green -n production --timeout=5m

Step 2: Run Smoke Tests

Run a quick validation suite on green to ensure the service is up and responding.

import requests
import asyncio

async def smoke_test_green():
"""Quick health check on green environment."""
checks = [
("health", "GET", "https://green.api.company.com/health"),
("ready", "GET", "https://green.api.company.com/ready"),
]

for check_name, method, url in checks:
try:
response = requests.request(method, url, timeout=2)
if response.status_code == 200:
print(f"✓ {check_name}")
else:
print(f"✗ {check_name}: {response.status_code}")
return False
except Exception as e:
print(f"✗ {check_name}: {e}")
return False

return True

Step 3: Validate Output Quality

Run your evaluation suite (from earlier articles) on green to confirm quality metrics meet thresholds.

# Run eval suite against green
python -m pytest tests/eval_suite.py \
--endpoint https://green.api.company.com \
--model-version claude-3-5-sonnet-v1.3.0 \
--threshold 0.85

# Capture results
pytest ... --json=eval_result.json

Step 4: A/B Test with Canary Traffic

Before switching all traffic, route a small percentage of requests to green and compare metrics with blue.

import random

def get_active_environment(request_id: str, canary_percentage: float = 1.0) -> str:
"""Route request to blue or green based on canary percentage."""
# Hash request ID for consistent routing
hash_val = hash(request_id) % 100

if hash_val < canary_percentage:
return "green" # new version
else:
return "blue" # current version

def compare_ab_results_before_switch(duration_minutes: int = 5):
"""Compare metrics between blue and green after canary test."""
# Query metrics for duration
blue_metrics = query_metrics("blue", duration_minutes)
green_metrics = query_metrics("green", duration_minutes)

comparison = {
"latency": {
"blue_p99": blue_metrics["latency_p99_ms"],
"green_p99": green_metrics["latency_p99_ms"],
"delta": green_metrics["latency_p99_ms"] - blue_metrics["latency_p99_ms"]
},
"accuracy": {
"blue": blue_metrics["accuracy"],
"green": green_metrics["accuracy"],
"improvement": (green_metrics["accuracy"] - blue_metrics["accuracy"]) * 100
},
"error_rate": {
"blue": blue_metrics["error_rate"],
"green": green_metrics["error_rate"]
}
}

# Decide: safe to switch?
safe = (
green_metrics["latency_p99_ms"] < blue_metrics["latency_p99_ms"] + 200
and green_metrics["error_rate"] < 0.01
and green_metrics["accuracy"] >= blue_metrics["accuracy"] - 0.02 # small regression OK
)

return comparison, safe

Step 5: Switch Traffic to Green

If validation passes, update the load balancer to send all traffic to green. This is a single configuration change and takes seconds.

# Load balancer configuration (e.g., AWS NLB, Nginx)
upstream llm_api {
# Before switch: send to blue
server blue-api.internal:8080 weight=100;
server green-api.internal:8080 weight=0;

# After switch: send to green
# server blue-api.internal:8080 weight=0;
# server green-api.internal:8080 weight=100;
}

server {
listen 443 ssl;
server_name api.company.com;

location / {
proxy_pass http://llm_api;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header X-Deployment $deployment; # log which env served request
}
}

Update via config management:

# Switch traffic to green
kubectl patch service llm-api -p \
'{"spec":{"selector":{"deployment":"green"}}}'

# Verify traffic is flowing to green
kubectl logs -l deployment=llm-green -n production --tail=20 | grep "request_id"

Rollback: Fast Recovery

If metrics degrade after switching, rollback in seconds by reverting the load balancer configuration.

# Rollback: switch traffic back to blue
kubectl patch service llm-api -p \
'{"spec":{"selector":{"deployment":"blue"}}}'

# Verify traffic returned to blue
sleep 5
kubectl logs -l deployment=llm-blue -n production --tail=20 | grep "request_id"

# Scale down green (it's now inactive)
kubectl scale deployment llm-green --replicas=0 -n production

Automated Rollback on Metric Threshold Breach

Configure automatic rollback if key metrics degrade unexpectedly.

import threading
import time

class AutomaticRollback:
def __init__(self, check_interval_seconds: int = 30):
self.check_interval = check_interval_seconds
self.active_deployment = "blue"
self.monitoring = True

def monitor_and_rollback(self):
"""Monitor active deployment; rollback if metrics fail."""
while self.monitoring:
metrics = get_deployment_metrics(self.active_deployment)

if self._should_rollback(metrics):
print(f"ROLLBACK: metrics degraded, reverting to previous version")
self._execute_rollback()
break

time.sleep(self.check_interval)

def _should_rollback(self, metrics: dict) -> bool:
"""Determine if metrics warrant a rollback."""
checks = [
metrics.get("error_rate", 0) > 0.05, # 5% error rate
metrics.get("latency_p99_ms", 0) > 5000, # 5s p99 latency
metrics.get("quality_score", 1.0) < 0.70, # quality dropped
]
return any(checks)

def _execute_rollback(self):
"""Execute rollback to previous deployment."""
previous = "blue" if self.active_deployment == "green" else "green"
print(f"Rolling back from {self.active_deployment} to {previous}")

# Switch load balancer
self._switch_traffic(previous)

# Scale down failed deployment
self._scale_deployment(self.active_deployment, 0)

self.active_deployment = previous

# Alert ops team
self._send_alert(f"Auto-rollback executed: {self.active_deployment} -> {previous}")

def _switch_traffic(self, target: str):
"""Switch load balancer to target deployment."""
# pseudo-code
pass

def _scale_deployment(self, name: str, replicas: int):
# pseudo-code
pass

def _send_alert(self, message: str):
# pseudo-code
print(f"ALERT: {message}")

Key Takeaways

  • Blue-green deployment maintains two identical production environments and switches traffic between them in seconds.
  • Deployment steps: deploy to inactive (green), smoke test, validate quality, A/B test with canary traffic, then switch.
  • Rollback is instant by reverting the load balancer configuration; no rebuilding or redeployment needed.
  • Monitor key metrics after switching and automatically rollback if error rate, latency, or quality thresholds are breached.
  • Use A/B testing before switching to compare new and old versions on live traffic, reducing deployment risk.

Frequently Asked Questions

Can I run more than two environments (blue, green, red)?

Yes, this is called multi-environment or canary deployment. Use blue for production, green as the next candidate, and red for rollback/emergency fallback. Advanced setups run 3-5 instances and gradually shift traffic, but two environments (blue-green) is simpler and sufficient for most teams.

What if my new version has a bug that only appears under load?

Smoke tests and A/B testing catch most load-related issues, but not all. Gradually shift traffic: 1% to green, 5%, 25%, 50%, 100%. Monitor closely at each step. If issues emerge at 5%, you limit blast radius. Automated rollback triggers help here.

How do I minimize the warm-up time for green before switching?

Pre-warm green by sending a portion of traffic during the A/B test phase (5-10 minutes before full switch). This loads model weights into cache, stabilizes response times, and triggers any initialization code. Measure p99 latency before and after warm-up to confirm readiness.

What if switching traffic is not instantaneous (requests in flight)?

Connection draining (graceful shutdown) handles in-flight requests. When switching load balancer rules, allow existing connections to finish (up to a timeout, e.g., 30 seconds) before tearing down the old environment. This minimizes dropped requests.

Can I use blue-green with serverless (Lambda, Cloud Functions)?

Yes. Deploy new code to a new function version, route a percentage of traffic via alias/traffic shift, then promote the alias to the new version. Rollback by reverting the alias. Serverless blue-green is similar but uses platform-specific traffic shifting instead of load balancer rules.

Further Reading