Skip to main content

Load Balancing Across LLM Providers: Multi-Provider Strategies

Relying on a single LLM provider is risky. OpenAI's API can degrade, Anthropic's quota can saturate, and costs explode. Multi-provider load balancing routes requests across OpenAI, Anthropic, Google Vertex, and local models based on cost, latency, availability, and model capability. With intelligent routing, you increase uptime from 99.9% to 99.99% (40x fewer outages), reduce costs by 30–50% through cheaper model fallbacks, and maintain service quality during provider incidents. This is how every serious LLM product achieves production reliability.

Single Provider vs. Multi-Provider Architecture

MetricSingle ProviderMulti-Provider
Uptime99.9% (8.7 hours downtime/year)99.99% (52 minutes downtime/year)
Cost per tokenFixedOptimized: 50–90% cheaper on average
Latency p99Single SLA (200–2000ms)Fastest available provider (typically 200ms)
Rate limitsHit one provider's ceilingAggregate across providers
Model diversityOne architectureMix of models for different strengths

With three independent providers (OpenAI 99.5%, Anthropic 99.8%, Google 99.9%), combined uptime is: 1 - (0.5% × 0.2% × 0.1%) = 99.99%.

Provider Abstraction Layer

Build a unified interface that hides provider-specific details:

from abc import ABC, abstractmethod
from typing import List, Dict, Optional
import asyncio

class LLMProvider(ABC):
"""Abstract base for LLM providers."""

@abstractmethod
async def generate(
self,
prompt: str,
model: str,
max_tokens: int = 1000,
temperature: float = 0.7,
) -> Dict:
"""
Generate text. Return: {
"text": str,
"tokens_input": int,
"tokens_output": int,
"latency_ms": float,
"cost": float,
}
"""
pass

@abstractmethod
async def health_check(self) -> bool:
"""Return True if provider is healthy."""
pass

@abstractmethod
def get_cost(self, input_tokens: int, output_tokens: int) -> float:
"""Calculate cost for this generation."""
pass


class OpenAIProvider(LLMProvider):
"""OpenAI API wrapper."""

async def generate(self, prompt: str, model: str, **kwargs) -> Dict:
import aiohttp
import time

start = time.time()

async with aiohttp.ClientSession() as session:
async with session.post(
"https://api.openai.com/v1/chat/completions",
json={
"model": model or "gpt-4",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": kwargs.get("max_tokens", 1000),
"temperature": kwargs.get("temperature", 0.7),
},
headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
timeout=aiohttp.ClientTimeout(total=30),
) as resp:
if resp.status != 200:
raise RuntimeError(f"OpenAI API error {resp.status}")

data = await resp.json()
latency_ms = (time.time() - start) * 1000

return {
"text": data["choices"][0]["message"]["content"],
"tokens_input": data["usage"]["prompt_tokens"],
"tokens_output": data["usage"]["completion_tokens"],
"latency_ms": latency_ms,
"cost": self.get_cost(
data["usage"]["prompt_tokens"],
data["usage"]["completion_tokens"],
),
"provider": "openai",
"model": model or "gpt-4",
}

async def health_check(self) -> bool:
"""Ping OpenAI to check if available."""
try:
result = await self.generate("test", "gpt-3.5-turbo")
return True
except:
return False

def get_cost(self, input_tokens: int, output_tokens: int) -> float:
# GPT-4 pricing
return input_tokens * 0.00003 + output_tokens * 0.00006


class AnthropicProvider(LLMProvider):
"""Anthropic Claude API wrapper."""

async def generate(self, prompt: str, model: str, **kwargs) -> Dict:
import httpx
import time

start = time.time()

async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.anthropic.com/v1/messages",
json={
"model": model or "claude-3-sonnet",
"max_tokens": kwargs.get("max_tokens", 1000),
"messages": [{"role": "user", "content": prompt}],
},
headers={
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
},
timeout=30,
)

if response.status_code != 200:
raise RuntimeError(f"Anthropic API error {response.status_code}")

data = response.json()
latency_ms = (time.time() - start) * 1000

return {
"text": data["content"][0]["text"],
"tokens_input": data["usage"]["input_tokens"],
"tokens_output": data["usage"]["output_tokens"],
"latency_ms": latency_ms,
"cost": self.get_cost(
data["usage"]["input_tokens"],
data["usage"]["output_tokens"],
),
"provider": "anthropic",
"model": model or "claude-3-sonnet",
}

async def health_check(self) -> bool:
try:
await self.generate("test", "claude-3-sonnet")
return True
except:
return False

def get_cost(self, input_tokens: int, output_tokens: int) -> float:
# Claude 3 Sonnet pricing
return input_tokens * 0.000003 + output_tokens * 0.000015

Router: Cost-Aware and Latency-Aware Load Balancing

import random
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class ProviderStats:
"""Track provider performance."""
name: str
success_count: int = 0
failure_count: int = 0
total_latency_ms: float = 0
total_cost: float = 0

@property
def success_rate(self) -> float:
total = self.success_count + self.failure_count
return self.success_count / total if total > 0 else 0

@property
def avg_latency_ms(self) -> float:
return self.total_latency_ms / max(1, self.success_count)

@property
def avg_cost(self) -> float:
return self.total_cost / max(1, self.success_count)


class LLMRouter:
"""
Route requests across multiple providers.
Strategy: cost-aware with latency weighting.
"""

def __init__(self, providers: List[LLMProvider]):
self.providers = providers
self.stats = {p.__class__.__name__: ProviderStats(p.__class__.__name__) for p in providers}

async def generate_with_fallback(
self,
prompt: str,
model_map: Dict[str, str], # {provider_name: model_id}
max_cost: Optional[float] = None,
max_latency_ms: Optional[float] = None,
) -> Dict:
"""
Try providers in order of cost efficiency and latency.
Fall back to next provider on failure.
"""
# Sort providers by estimated cost (ascending).
sorted_providers = sorted(
self.providers,
key=lambda p: self.stats[p.__class__.__name__].avg_cost,
)

errors = []

for provider in sorted_providers:
provider_name = provider.__class__.__name__

# Skip if health check fails.
if not await provider.health_check():
errors.append(f"{provider_name}: unhealthy")
continue

try:
result = await provider.generate(
prompt,
model_map.get(provider_name, "default"),
)

# Check constraints.
if max_cost and result["cost"] > max_cost:
errors.append(f"{provider_name}: cost ${result['cost']:.6f} > ${max_cost:.6f}")
continue

if max_latency_ms and result["latency_ms"] > max_latency_ms:
errors.append(f"{provider_name}: latency {result['latency_ms']:.0f}ms > {max_latency_ms:.0f}ms")
continue

# Success! Update stats.
self.stats[provider_name].success_count += 1
self.stats[provider_name].total_latency_ms += result["latency_ms"]
self.stats[provider_name].total_cost += result["cost"]

return result

except Exception as e:
self.stats[provider_name].failure_count += 1
errors.append(f"{provider_name}: {str(e)}")

# All providers failed.
raise RuntimeError(f"All LLM providers failed: {errors}")

def get_stats(self) -> Dict:
"""Return performance stats for monitoring."""
return {
name: {
"success_rate": stats.success_rate,
"avg_latency_ms": stats.avg_latency_ms,
"avg_cost": stats.avg_cost,
}
for name, stats in self.stats.items()
}


# Example usage
router = LLMRouter([
OpenAIProvider(),
AnthropicProvider(),
])

result = await router.generate_with_fallback(
prompt="Explain quantum computing",
model_map={
"OpenAIProvider": "gpt-4",
"AnthropicProvider": "claude-3-sonnet",
},
max_cost=0.0001, # Max $0.0001 per request
max_latency_ms=2000, # Max 2 second latency
)

print(f"Response from {result['provider']}: {result['text'][:100]}")
print(f"Cost: ${result['cost']:.6f}, Latency: {result['latency_ms']:.0f}ms")
print(f"\nProvider stats: {router.get_stats()}")

Intelligent Routing: Capability-Based Selection

Different providers excel at different tasks. Route based on task type:

class CapabilityRouter:
"""Route by task: reasoning → GPT-4, fast → GPT-3.5, cost-sensitive → Claude Haiku."""

async def generate(self, prompt: str, task_type: str) -> Dict:
"""
Route to provider by task characteristics.
"""
if task_type == "reasoning":
# Complex logic: use GPT-4
return await self.router.generate_with_fallback(
prompt,
model_map={"OpenAIProvider": "gpt-4"},
)
elif task_type == "fast":
# Speed critical: use fastest available
return await self.router.generate_with_fallback(
prompt,
model_map={"OpenAIProvider": "gpt-3.5-turbo"},
max_latency_ms=500,
)
elif task_type == "summarize":
# Cost-critical: use cheapest
return await self.router.generate_with_fallback(
prompt,
model_map={
"AnthropicProvider": "claude-3-haiku",
"OpenAIProvider": "gpt-3.5-turbo",
},
max_cost=0.00001, # Very cheap
)

Task classification can be rule-based (prompt length, keywords) or ML-based (classifier model).

Circuit Breaker Pattern for Provider Failure Recovery

When a provider fails repeatedly, stop sending traffic to it for a cooldown period:

import time
from enum import Enum

class CircuitState(Enum):
CLOSED = "closed" # Normal, send traffic
OPEN = "open" # Failing, skip this provider
HALF_OPEN = "half_open" # Testing, send 10% traffic


class CircuitBreaker:
"""
Circuit breaker: detect and isolate failing providers.
"""

def __init__(self, failure_threshold: int = 5, cooldown_sec: int = 60):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = None
self.failure_threshold = failure_threshold
self.cooldown_sec = cooldown_sec

def record_success(self) -> None:
"""Reset on success."""
self.failure_count = 0
self.state = CircuitState.CLOSED

def record_failure(self) -> None:
"""Increment failure count; open circuit if threshold exceeded."""
self.failure_count += 1
self.last_failure_time = time.time()

if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN

def should_allow_request(self) -> bool:
"""Check if request should be allowed."""
if self.state == CircuitState.CLOSED:
return True

if self.state == CircuitState.OPEN:
# Check if cooldown expired; move to HALF_OPEN.
if time.time() - self.last_failure_time > self.cooldown_sec:
self.state = CircuitState.HALF_OPEN
self.failure_count = 0 # Reset for testing
return True
return False

# HALF_OPEN: allow request to test if provider recovered.
return True


# Wrap provider with circuit breaker.
openai_breaker = CircuitBreaker(failure_threshold=5, cooldown_sec=60)

async def call_openai_safe(prompt: str) -> Dict:
if not openai_breaker.should_allow_request():
raise RuntimeError("OpenAI provider circuit is OPEN. Retrying later.")

try:
result = await openai_provider.generate(prompt, "gpt-4")
openai_breaker.record_success()
return result
except Exception as e:
openai_breaker.record_failure()
raise

Cost Tracking Across Providers

Monitor and alert on per-provider costs:

class CostTracker:
"""Track costs by provider for optimization."""

def __init__(self):
self.costs_by_provider = {}

def record(self, provider: str, cost: float) -> None:
"""Record cost for a generation."""
if provider not in self.costs_by_provider:
self.costs_by_provider[provider] = []

self.costs_by_provider[provider].append(cost)

def get_summary(self) -> Dict:
"""Get cost breakdown."""
summary = {}
total_cost = 0

for provider, costs in self.costs_by_provider.items():
provider_total = sum(costs)
summary[provider] = {
"requests": len(costs),
"total": provider_total,
"avg_per_request": provider_total / len(costs),
}
total_cost += provider_total

# Calculate percentage.
for provider in summary:
summary[provider]["percentage"] = (
summary[provider]["total"] / total_cost * 100
) if total_cost > 0 else 0

summary["total"] = total_cost
return summary


tracker = CostTracker()

# After each request:
tracker.record(result["provider"], result["cost"])

# Analyze monthly:
summary = tracker.get_summary()
print(f"OpenAI: ${summary['OpenAI']['total']:.2f} ({summary['OpenAI']['percentage']:.1f}%)")
print(f"Anthropic: ${summary['Anthropic']['total']:.2f} ({summary['Anthropic']['percentage']:.1f}%)")

Key Takeaways

  • Multi-provider routing increases uptime 10–100x: Aggregate independent providers to near-99.99% SLA.
  • Cost-aware routing cuts costs 30–50%: Choose cheaper models for simple tasks, expensive models only for complex reasoning.
  • Capability routing matches task to model: Fast tasks → GPT-3.5-turbo, reasoning → GPT-4, cost-sensitive → Claude Haiku.
  • Circuit breakers isolate failures: Stop sending traffic to failing providers; recover gracefully when healthy again.
  • Track costs per provider: Optimize vendor mix to minimize total spend.

Frequently Asked Questions

How many providers do I need?

Two is the minimum for failover (increases uptime 100–1000x). Three or more enable cost optimization. We recommend: one premium (GPT-4), one mid-tier (Claude Sonnet), one budget (GPT-3.5-turbo or Claude Haiku).

Should I use local LLMs in the provider mix?

Yes, for certain use cases: simple classification, summarization, data extraction. Local models (Llama 2, Mistral) have zero latency and cost, but lower quality. Use as fallback for cost-critical, quality-tolerant tasks.

How do I handle token compatibility across providers?

Token count varies: OpenAI's GPT-4 uses different tokenization than Claude. Estimate conservatively (use the highest estimate) or get actual token counts from each provider's API response and store in your database.

What if a provider raises prices?

Update the cost function in the provider class. The router automatically optimizes: if OpenAI gets expensive, Claude becomes the default choice.

Can I use multi-provider load balancing for fine-tuning?

Not directly: fine-tuned models are provider-specific. But you can use the abstraction layer to manage fine-tuned versions: router selects base model, then you route to fine-tuned endpoint.

Further Reading