Load Balancing Across LLM Providers: Multi-Provider Strategies
Relying on a single LLM provider is risky. OpenAI's API can degrade, Anthropic's quota can saturate, and costs explode. Multi-provider load balancing routes requests across OpenAI, Anthropic, Google Vertex, and local models based on cost, latency, availability, and model capability. With intelligent routing, you increase uptime from 99.9% to 99.99% (40x fewer outages), reduce costs by 30–50% through cheaper model fallbacks, and maintain service quality during provider incidents. This is how every serious LLM product achieves production reliability.
Single Provider vs. Multi-Provider Architecture
| Metric | Single Provider | Multi-Provider |
|---|---|---|
| Uptime | 99.9% (8.7 hours downtime/year) | 99.99% (52 minutes downtime/year) |
| Cost per token | Fixed | Optimized: 50–90% cheaper on average |
| Latency p99 | Single SLA (200–2000ms) | Fastest available provider (typically 200ms) |
| Rate limits | Hit one provider's ceiling | Aggregate across providers |
| Model diversity | One architecture | Mix of models for different strengths |
With three independent providers (OpenAI 99.5%, Anthropic 99.8%, Google 99.9%), combined uptime is: 1 - (0.5% × 0.2% × 0.1%) = 99.99%.
Provider Abstraction Layer
Build a unified interface that hides provider-specific details:
from abc import ABC, abstractmethod
from typing import List, Dict, Optional
import asyncio
class LLMProvider(ABC):
"""Abstract base for LLM providers."""
@abstractmethod
async def generate(
self,
prompt: str,
model: str,
max_tokens: int = 1000,
temperature: float = 0.7,
) -> Dict:
"""
Generate text. Return: {
"text": str,
"tokens_input": int,
"tokens_output": int,
"latency_ms": float,
"cost": float,
}
"""
pass
@abstractmethod
async def health_check(self) -> bool:
"""Return True if provider is healthy."""
pass
@abstractmethod
def get_cost(self, input_tokens: int, output_tokens: int) -> float:
"""Calculate cost for this generation."""
pass
class OpenAIProvider(LLMProvider):
"""OpenAI API wrapper."""
async def generate(self, prompt: str, model: str, **kwargs) -> Dict:
import aiohttp
import time
start = time.time()
async with aiohttp.ClientSession() as session:
async with session.post(
"https://api.openai.com/v1/chat/completions",
json={
"model": model or "gpt-4",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": kwargs.get("max_tokens", 1000),
"temperature": kwargs.get("temperature", 0.7),
},
headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
timeout=aiohttp.ClientTimeout(total=30),
) as resp:
if resp.status != 200:
raise RuntimeError(f"OpenAI API error {resp.status}")
data = await resp.json()
latency_ms = (time.time() - start) * 1000
return {
"text": data["choices"][0]["message"]["content"],
"tokens_input": data["usage"]["prompt_tokens"],
"tokens_output": data["usage"]["completion_tokens"],
"latency_ms": latency_ms,
"cost": self.get_cost(
data["usage"]["prompt_tokens"],
data["usage"]["completion_tokens"],
),
"provider": "openai",
"model": model or "gpt-4",
}
async def health_check(self) -> bool:
"""Ping OpenAI to check if available."""
try:
result = await self.generate("test", "gpt-3.5-turbo")
return True
except:
return False
def get_cost(self, input_tokens: int, output_tokens: int) -> float:
# GPT-4 pricing
return input_tokens * 0.00003 + output_tokens * 0.00006
class AnthropicProvider(LLMProvider):
"""Anthropic Claude API wrapper."""
async def generate(self, prompt: str, model: str, **kwargs) -> Dict:
import httpx
import time
start = time.time()
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.anthropic.com/v1/messages",
json={
"model": model or "claude-3-sonnet",
"max_tokens": kwargs.get("max_tokens", 1000),
"messages": [{"role": "user", "content": prompt}],
},
headers={
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
},
timeout=30,
)
if response.status_code != 200:
raise RuntimeError(f"Anthropic API error {response.status_code}")
data = response.json()
latency_ms = (time.time() - start) * 1000
return {
"text": data["content"][0]["text"],
"tokens_input": data["usage"]["input_tokens"],
"tokens_output": data["usage"]["output_tokens"],
"latency_ms": latency_ms,
"cost": self.get_cost(
data["usage"]["input_tokens"],
data["usage"]["output_tokens"],
),
"provider": "anthropic",
"model": model or "claude-3-sonnet",
}
async def health_check(self) -> bool:
try:
await self.generate("test", "claude-3-sonnet")
return True
except:
return False
def get_cost(self, input_tokens: int, output_tokens: int) -> float:
# Claude 3 Sonnet pricing
return input_tokens * 0.000003 + output_tokens * 0.000015
Router: Cost-Aware and Latency-Aware Load Balancing
import random
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class ProviderStats:
"""Track provider performance."""
name: str
success_count: int = 0
failure_count: int = 0
total_latency_ms: float = 0
total_cost: float = 0
@property
def success_rate(self) -> float:
total = self.success_count + self.failure_count
return self.success_count / total if total > 0 else 0
@property
def avg_latency_ms(self) -> float:
return self.total_latency_ms / max(1, self.success_count)
@property
def avg_cost(self) -> float:
return self.total_cost / max(1, self.success_count)
class LLMRouter:
"""
Route requests across multiple providers.
Strategy: cost-aware with latency weighting.
"""
def __init__(self, providers: List[LLMProvider]):
self.providers = providers
self.stats = {p.__class__.__name__: ProviderStats(p.__class__.__name__) for p in providers}
async def generate_with_fallback(
self,
prompt: str,
model_map: Dict[str, str], # {provider_name: model_id}
max_cost: Optional[float] = None,
max_latency_ms: Optional[float] = None,
) -> Dict:
"""
Try providers in order of cost efficiency and latency.
Fall back to next provider on failure.
"""
# Sort providers by estimated cost (ascending).
sorted_providers = sorted(
self.providers,
key=lambda p: self.stats[p.__class__.__name__].avg_cost,
)
errors = []
for provider in sorted_providers:
provider_name = provider.__class__.__name__
# Skip if health check fails.
if not await provider.health_check():
errors.append(f"{provider_name}: unhealthy")
continue
try:
result = await provider.generate(
prompt,
model_map.get(provider_name, "default"),
)
# Check constraints.
if max_cost and result["cost"] > max_cost:
errors.append(f"{provider_name}: cost ${result['cost']:.6f} > ${max_cost:.6f}")
continue
if max_latency_ms and result["latency_ms"] > max_latency_ms:
errors.append(f"{provider_name}: latency {result['latency_ms']:.0f}ms > {max_latency_ms:.0f}ms")
continue
# Success! Update stats.
self.stats[provider_name].success_count += 1
self.stats[provider_name].total_latency_ms += result["latency_ms"]
self.stats[provider_name].total_cost += result["cost"]
return result
except Exception as e:
self.stats[provider_name].failure_count += 1
errors.append(f"{provider_name}: {str(e)}")
# All providers failed.
raise RuntimeError(f"All LLM providers failed: {errors}")
def get_stats(self) -> Dict:
"""Return performance stats for monitoring."""
return {
name: {
"success_rate": stats.success_rate,
"avg_latency_ms": stats.avg_latency_ms,
"avg_cost": stats.avg_cost,
}
for name, stats in self.stats.items()
}
# Example usage
router = LLMRouter([
OpenAIProvider(),
AnthropicProvider(),
])
result = await router.generate_with_fallback(
prompt="Explain quantum computing",
model_map={
"OpenAIProvider": "gpt-4",
"AnthropicProvider": "claude-3-sonnet",
},
max_cost=0.0001, # Max $0.0001 per request
max_latency_ms=2000, # Max 2 second latency
)
print(f"Response from {result['provider']}: {result['text'][:100]}")
print(f"Cost: ${result['cost']:.6f}, Latency: {result['latency_ms']:.0f}ms")
print(f"\nProvider stats: {router.get_stats()}")
Intelligent Routing: Capability-Based Selection
Different providers excel at different tasks. Route based on task type:
class CapabilityRouter:
"""Route by task: reasoning → GPT-4, fast → GPT-3.5, cost-sensitive → Claude Haiku."""
async def generate(self, prompt: str, task_type: str) -> Dict:
"""
Route to provider by task characteristics.
"""
if task_type == "reasoning":
# Complex logic: use GPT-4
return await self.router.generate_with_fallback(
prompt,
model_map={"OpenAIProvider": "gpt-4"},
)
elif task_type == "fast":
# Speed critical: use fastest available
return await self.router.generate_with_fallback(
prompt,
model_map={"OpenAIProvider": "gpt-3.5-turbo"},
max_latency_ms=500,
)
elif task_type == "summarize":
# Cost-critical: use cheapest
return await self.router.generate_with_fallback(
prompt,
model_map={
"AnthropicProvider": "claude-3-haiku",
"OpenAIProvider": "gpt-3.5-turbo",
},
max_cost=0.00001, # Very cheap
)
Task classification can be rule-based (prompt length, keywords) or ML-based (classifier model).
Circuit Breaker Pattern for Provider Failure Recovery
When a provider fails repeatedly, stop sending traffic to it for a cooldown period:
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal, send traffic
OPEN = "open" # Failing, skip this provider
HALF_OPEN = "half_open" # Testing, send 10% traffic
class CircuitBreaker:
"""
Circuit breaker: detect and isolate failing providers.
"""
def __init__(self, failure_threshold: int = 5, cooldown_sec: int = 60):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = None
self.failure_threshold = failure_threshold
self.cooldown_sec = cooldown_sec
def record_success(self) -> None:
"""Reset on success."""
self.failure_count = 0
self.state = CircuitState.CLOSED
def record_failure(self) -> None:
"""Increment failure count; open circuit if threshold exceeded."""
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
def should_allow_request(self) -> bool:
"""Check if request should be allowed."""
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
# Check if cooldown expired; move to HALF_OPEN.
if time.time() - self.last_failure_time > self.cooldown_sec:
self.state = CircuitState.HALF_OPEN
self.failure_count = 0 # Reset for testing
return True
return False
# HALF_OPEN: allow request to test if provider recovered.
return True
# Wrap provider with circuit breaker.
openai_breaker = CircuitBreaker(failure_threshold=5, cooldown_sec=60)
async def call_openai_safe(prompt: str) -> Dict:
if not openai_breaker.should_allow_request():
raise RuntimeError("OpenAI provider circuit is OPEN. Retrying later.")
try:
result = await openai_provider.generate(prompt, "gpt-4")
openai_breaker.record_success()
return result
except Exception as e:
openai_breaker.record_failure()
raise
Cost Tracking Across Providers
Monitor and alert on per-provider costs:
class CostTracker:
"""Track costs by provider for optimization."""
def __init__(self):
self.costs_by_provider = {}
def record(self, provider: str, cost: float) -> None:
"""Record cost for a generation."""
if provider not in self.costs_by_provider:
self.costs_by_provider[provider] = []
self.costs_by_provider[provider].append(cost)
def get_summary(self) -> Dict:
"""Get cost breakdown."""
summary = {}
total_cost = 0
for provider, costs in self.costs_by_provider.items():
provider_total = sum(costs)
summary[provider] = {
"requests": len(costs),
"total": provider_total,
"avg_per_request": provider_total / len(costs),
}
total_cost += provider_total
# Calculate percentage.
for provider in summary:
summary[provider]["percentage"] = (
summary[provider]["total"] / total_cost * 100
) if total_cost > 0 else 0
summary["total"] = total_cost
return summary
tracker = CostTracker()
# After each request:
tracker.record(result["provider"], result["cost"])
# Analyze monthly:
summary = tracker.get_summary()
print(f"OpenAI: ${summary['OpenAI']['total']:.2f} ({summary['OpenAI']['percentage']:.1f}%)")
print(f"Anthropic: ${summary['Anthropic']['total']:.2f} ({summary['Anthropic']['percentage']:.1f}%)")
Key Takeaways
- Multi-provider routing increases uptime 10–100x: Aggregate independent providers to near-99.99% SLA.
- Cost-aware routing cuts costs 30–50%: Choose cheaper models for simple tasks, expensive models only for complex reasoning.
- Capability routing matches task to model: Fast tasks → GPT-3.5-turbo, reasoning → GPT-4, cost-sensitive → Claude Haiku.
- Circuit breakers isolate failures: Stop sending traffic to failing providers; recover gracefully when healthy again.
- Track costs per provider: Optimize vendor mix to minimize total spend.
Frequently Asked Questions
How many providers do I need?
Two is the minimum for failover (increases uptime 100–1000x). Three or more enable cost optimization. We recommend: one premium (GPT-4), one mid-tier (Claude Sonnet), one budget (GPT-3.5-turbo or Claude Haiku).
Should I use local LLMs in the provider mix?
Yes, for certain use cases: simple classification, summarization, data extraction. Local models (Llama 2, Mistral) have zero latency and cost, but lower quality. Use as fallback for cost-critical, quality-tolerant tasks.
How do I handle token compatibility across providers?
Token count varies: OpenAI's GPT-4 uses different tokenization than Claude. Estimate conservatively (use the highest estimate) or get actual token counts from each provider's API response and store in your database.
What if a provider raises prices?
Update the cost function in the provider class. The router automatically optimizes: if OpenAI gets expensive, Claude becomes the default choice.
Can I use multi-provider load balancing for fine-tuning?
Not directly: fine-tuned models are provider-specific. But you can use the abstraction layer to manage fine-tuned versions: router selects base model, then you route to fine-tuned endpoint.
Further Reading
- Building Reliable Systems with Fallbacks — circuit breaker theory.
- OpenAI API Documentation — rate limits and failover strategies.
- Anthropic Claude API — alternative provider.
- Google Vertex AI — another major provider for comparison.