Skip to main content

Circuit Breaker Pattern: Fail Fast and Recover

The circuit breaker pattern is a safety mechanism borrowed from electrical systems. Just as a circuit breaker trips to prevent fires, a software circuit breaker trips to prevent cascading failures. When an API is struggling (returning many errors), a circuit breaker stops sending requests to it immediately instead of continuing to retry and waste quota. This allows the struggling service to recover without being hammered by requests.

A circuit breaker is a state machine with three states: closed (requests flow normally), open (requests fail immediately without trying), and half-open (a few test requests are allowed to check if recovery is happening). When errors exceed a threshold, the breaker opens. After a timeout, it enters half-open and tests recovery. If tests succeed, it closes and normal operation resumes.

Understanding Circuit Breaker States

Closed state: The normal operating mode. Requests flow to the API. If errors occur, they are counted. If the error count stays below the threshold, the breaker remains closed. This state is "healthy."

Open state: An error threshold has been exceeded (e.g., 5 failures in 60 seconds). The breaker opens and rejects all new requests immediately without attempting them. This is the "circuit is tripped" state. It protects the struggling API from further load and saves your quota.

Half-open state: After the open state lasts for a timeout period (e.g., 30 seconds), the breaker transitions to half-open. In this state, a limited number of requests are allowed through as "test probes" to see if the API has recovered. If these requests succeed, the breaker closes and resumes normal operation. If they fail, the breaker opens again.

The diagram below illustrates the transitions:

        request fails     error threshold exceeded     timeout expires
counts circuit opens circuit half-opens
↓ ↓ ↓
┌─────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ CLOSED │────→│ OPEN │────→│ HALF-OPEN │
│ (requests │ │ (requests fail │ │ (limited test │
│ flowing) │ │ immediately) │ │ requests allowed) │
└─────────────┘ └──────────────────┘ └─────────────────────┘
↑ ↓
└───────────────────────────────────────────────┘
test requests succeed
circuit closes

Implementing a Circuit Breaker in Python

Here is a production-ready implementation:

import time
from enum import Enum
from threading import Lock
from typing import Callable, TypeVar, Any

T = TypeVar('T')

class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing fast
HALF_OPEN = "half_open" # Testing recovery

class CircuitBreaker:
"""Thread-safe circuit breaker for API calls."""

def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: float = 30.0,
expected_exception: type = Exception
):
"""
Args:
failure_threshold: Number of failures before opening circuit
recovery_timeout: Seconds to wait before entering half-open
expected_exception: Exception type to catch (default: all)
"""
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.expected_exception = expected_exception

self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.lock = Lock()

def call(self, func: Callable[..., T], *args, **kwargs) -> T:
"""
Execute func through the circuit breaker.

Raises:
CircuitBreakerOpen: If the circuit is open
The original exception if the call fails
"""
with self.lock:
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
self.success_count = 0
else:
raise CircuitBreakerOpen(
f"Circuit is open. Retry after "
f"{self.recovery_timeout}s."
)

try:
result = func(*args, **kwargs)
self._on_success()
return result
except self.expected_exception as e:
self._on_failure()
raise

def _on_success(self):
"""Handle successful request."""
with self.lock:
self.failure_count = 0

if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
# Close circuit after 2 successful half-open requests
if self.success_count >= 2:
self.state = CircuitState.CLOSED
print(f"Circuit closed. API recovered.")

def _on_failure(self):
"""Handle failed request."""
with self.lock:
self.failure_count += 1
self.last_failure_time = time.time()

if self.state == CircuitState.HALF_OPEN:
# A single failure in half-open reopens the circuit
self.state = CircuitState.OPEN
print(f"Circuit reopened. API still struggling.")
elif self.state == CircuitState.CLOSED:
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
print(f"Circuit opened. Failures: {self.failure_count}.")

def _should_attempt_reset(self) -> bool:
"""Check if recovery timeout has elapsed."""
if self.last_failure_time is None:
return False
elapsed = time.time() - self.last_failure_time
return elapsed >= self.recovery_timeout

def state_info(self) -> dict:
"""Return current state info for monitoring."""
with self.lock:
return {
"state": self.state.value,
"failure_count": self.failure_count,
"last_failure_time": self.last_failure_time
}

class CircuitBreakerOpen(Exception):
"""Raised when circuit breaker is open."""
pass

Usage with an LLM API:

import requests

# Create a circuit breaker for OpenAI API
breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=30.0,
expected_exception=requests.exceptions.RequestException
)

def call_openai():
"""Make OpenAI API call."""
response = requests.post(
"https://api.openai.com/v1/chat/completions",
json={"model": "gpt-4", "messages": [...]},
timeout=60
)
response.raise_for_status()
return response.json()

try:
result = breaker.call(call_openai)
except CircuitBreakerOpen as e:
print(f"Circuit is open: {e}")
# Handle gracefully: return cached result, use fallback API, etc.
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
# Implement exponential backoff here

Implementing in JavaScript/TypeScript

Here is an async TypeScript version:

type CircuitState = "closed" | "open" | "half-open";

interface CircuitBreakerConfig {
failureThreshold?: number;
recoveryTimeout?: number; // milliseconds
successThresholdInHalfOpen?: number;
}

class CircuitBreaker<T> {
private state: CircuitState = "closed";
private failureCount = 0;
private successCount = 0;
private lastFailureTime: number | null = null;

constructor(private config: CircuitBreakerConfig = {}) {
this.config = {
failureThreshold: 5,
recoveryTimeout: 30000,
successThresholdInHalfOpen: 2,
...config
};
}

async execute<R>(fn: () => Promise<R>): Promise<R> {
if (this.state === "open") {
if (this.shouldAttemptReset()) {
this.state = "half-open";
this.successCount = 0;
} else {
throw new Error(`Circuit breaker is open`);
}
}

try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}

private onSuccess(): void {
this.failureCount = 0;

if (this.state === "half-open") {
this.successCount++;
if (this.successCount >= (this.config.successThresholdInHalfOpen ?? 2)) {
this.state = "closed";
console.log("Circuit closed. API recovered.");
}
}
}

private onFailure(): void {
this.failureCount++;
this.lastFailureTime = Date.now();

if (this.state === "half-open") {
this.state = "open";
console.log("Circuit reopened. API still struggling.");
} else if (this.state === "closed") {
if (this.failureCount >= (this.config.failureThreshold ?? 5)) {
this.state = "open";
console.log(`Circuit opened. Failures: ${this.failureCount}.`);
}
}
}

private shouldAttemptReset(): boolean {
if (!this.lastFailureTime) return false;
const elapsed = Date.now() - this.lastFailureTime;
return elapsed >= (this.config.recoveryTimeout ?? 30000);
}

getState(): { state: CircuitState; failureCount: number } {
return { state: this.state, failureCount: this.failureCount };
}
}

// Usage
const breaker = new CircuitBreaker<any>({
failureThreshold: 5,
recoveryTimeout: 30000
});

async function callAnthropicAPI() {
return breaker.execute(async () => {
const response = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: { "x-api-key": apiKey },
body: JSON.stringify({
model: "claude-3-sonnet-20240229",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello" }]
})
});

if (!response.ok) throw new Error(`HTTP ${response.status}`);
return response.json();
});
}

// Wrap in try/catch to handle CircuitBreakerOpen
try {
const result = await callAnthropicAPI();
} catch (error) {
if (error.message.includes("Circuit breaker")) {
console.log("Using fallback or cached response");
}
}

Monitoring and Observability

In production, monitor your circuit breaker states. Use metrics or logging to track:

  • How often the circuit opens (high open rate = unhealthy upstream)
  • Average recovery time (how long until the API stabilizes)
  • Request success rate during half-open (indicates partial recovery)
import logging

logger = logging.getLogger(__name__)

class MonitoredCircuitBreaker(CircuitBreaker):
"""Circuit breaker with logging."""

def _on_failure(self):
super()._on_failure()
info = self.state_info()
logger.warning(
f"Request failed. Circuit state: {info['state']}, "
f"Failures: {info['failure_count']}"
)

def call(self, func, *args, **kwargs):
try:
result = super().call(func, *args, **kwargs)
logger.info("Request succeeded")
return result
except CircuitBreakerOpen:
logger.error("Circuit is open; request rejected")
raise

Key Takeaways

  • Circuit breakers prevent cascading failures by rejecting requests when an API is struggling.
  • States are closed (normal), open (rejecting), and half-open (testing recovery).
  • Circuit opens when errors exceed a threshold; it half-opens after a timeout; it closes when tests succeed.
  • Use a separate circuit breaker for each critical external dependency (OpenAI, Anthropic, database).
  • Monitor circuit state to detect and alert on persistent outages.

Frequently Asked Questions

When should I increase the failure threshold?

If your threshold is too low, the breaker opens too quickly and users see "API unavailable" errors. If it is too high, you waste quota hammering a struggling service. A threshold of 5 failures in 60 seconds is reasonable for most APIs. Increase it if your API is naturally flaky; decrease it if you can afford stricter rejection.

Should each API provider have its own circuit breaker?

Yes. Each provider (OpenAI, Anthropic) should have a separate breaker. If OpenAI is down but Anthropic is healthy, you want to reject OpenAI requests but continue trying Anthropic.

Can I have nested circuit breakers?

Yes, but it is complex. If you have a circuit breaker for your API and another for the upstream LLM API, they can interact in unexpected ways. Prefer flat, independent breakers per dependency.

What is the difference between circuit breaker and exponential backoff?

Exponential backoff spaces out retries to give a service time to recover. Circuit breaker stops retrying altogether when conditions are bad. Use both: exponential backoff for occasional errors, circuit breaker for systemic failures.

Further Reading