Skip to main content

Handling Rate Limits: Best Practices for LLM APIs

Rate limiting is the API provider's way of enforcing fair resource usage. Every LLM API has limits: OpenAI might allow 3,500 requests per minute on a free tier, or 450,000 tokens per minute on a paid plan. Anthropic, together with other providers, enforce both per-minute and per-day quotas. When you exceed these limits, the API returns a 429 Too Many Requests response, telling you to slow down. Handling rate limits gracefully means respecting these signals, spreading requests over time, and avoiding bans or throttling.

The key insight is that rate limits are not failures—they are information. A 429 response tells you how much capacity you have left and how long to wait. By parsing these headers and respecting them, you avoid cascading failures, keep your API key in good standing, and maximize your effective throughput.

Understanding Rate Limit Headers

Most API providers expose rate limit details via HTTP response headers. OpenAI and Anthropic use a standard format:

  • x-ratelimit-limit-requests: Your request quota for this period (e.g., 3,500 requests/minute)
  • x-ratelimit-limit-tokens: Your token quota for this period (e.g., 450,000 tokens/minute)
  • x-ratelimit-remaining-requests: Requests still available before hitting the limit
  • x-ratelimit-remaining-tokens: Tokens still available before hitting the limit
  • x-ratelimit-reset-requests: Unix timestamp when request quota resets
  • x-ratelimit-reset-tokens: Unix timestamp when token quota resets
  • retry-after: How long to wait (in seconds) before retrying (sent on 429)

Parse these headers on every response, not just 429 errors. They give you real-time visibility into your consumption:

import requests
from datetime import datetime, timezone

def parse_rate_limit_headers(response: requests.Response) -> dict:
"""Extract and return rate limit details from response headers."""
return {
"requests_limit": int(response.headers.get("x-ratelimit-limit-requests", 0)),
"requests_remaining": int(response.headers.get("x-ratelimit-remaining-requests", 0)),
"requests_reset": int(response.headers.get("x-ratelimit-reset-requests", 0)),
"tokens_limit": int(response.headers.get("x-ratelimit-limit-tokens", 0)),
"tokens_remaining": int(response.headers.get("x-ratelimit-remaining-tokens", 0)),
"tokens_reset": int(response.headers.get("x-ratelimit-reset-tokens", 0)),
"retry_after": response.headers.get("retry-after")
}

def make_request_with_rate_limit_tracking(url: str, api_key: str):
"""Make request and track rate limit status."""
headers = {"Authorization": f"Bearer {api_key}"}

try:
response = requests.post(url, headers=headers, timeout=30)
response.raise_for_status()

# Parse and log rate limits on success
limits = parse_rate_limit_headers(response)
print(f"Requests remaining: {limits['requests_remaining']} / {limits['requests_limit']}")
print(f"Tokens remaining: {limits['tokens_remaining']} / {limits['tokens_limit']}")

return response.json()

except requests.exceptions.HTTPError as e:
if e.response.status_code == 429:
limits = parse_rate_limit_headers(e.response)
retry_after = limits.get("retry_after", "unknown")
print(f"Rate limited! Retry after {retry_after} seconds.")
print(f"Requests reset at: {datetime.fromtimestamp(limits['requests_reset'], tz=timezone.utc)}")
raise

Token Bucket Algorithm for Client-Side Rate Limiting

Rather than waiting for the server to send 429, proactive clients implement client-side rate limiting using the token bucket algorithm. This algorithm allows you to track your quota locally, smooth out bursty traffic, and fail fast before hitting the server limit.

The algorithm works like this: imagine a bucket that holds tokens (each token represents one request or N tokens). Tokens fill the bucket at a constant rate (determined by your quota). When you make a request, you remove tokens from the bucket. If the bucket is empty, you wait until tokens refill. This naturally throttles traffic to match your quota.

Here is a production-ready implementation:

import time
from threading import Lock

class TokenBucket:
"""Thread-safe token bucket for rate limiting."""

def __init__(self, capacity: int, refill_rate: float):
"""
Args:
capacity: Maximum tokens in the bucket (e.g., 3500 for 3500 req/min)
refill_rate: Tokens per second (e.g., 3500/60 = 58.33 for 3500 req/min)
"""
self.capacity = capacity
self.refill_rate = refill_rate
self.tokens = capacity
self.last_refill_time = time.time()
self.lock = Lock()

def consume(self, tokens: int = 1, wait: bool = True) -> bool:
"""
Attempt to consume tokens. If wait=True, block until tokens are available.

Returns:
True if tokens were consumed, False if not enough tokens and wait=False.
"""
with self.lock:
# Refill tokens based on elapsed time
now = time.time()
elapsed = now - self.last_refill_time
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.refill_rate
)
self.last_refill_time = now

# Check if we can consume
if self.tokens >= tokens:
self.tokens -= tokens
return True

if not wait:
return False

# Calculate wait time and busy-wait with small sleeps
while self.tokens < tokens:
time.sleep(0.01) # Check every 10ms
now = time.time()
elapsed = now - self.last_refill_time
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.refill_rate
)
self.last_refill_time = now

self.tokens -= tokens
return True

def available(self) -> int:
"""Return the current number of available tokens without consuming."""
with self.lock:
now = time.time()
elapsed = now - self.last_refill_time
return min(
self.capacity,
int(self.tokens + elapsed * self.refill_rate)
)

Usage with OpenAI API:

# OpenAI rate limit: 3,500 requests per minute
bucket = TokenBucket(capacity=3500, refill_rate=3500 / 60)

def call_openai_with_rate_limiting(prompt: str) -> dict:
"""Make OpenAI request with client-side rate limiting."""
# Wait until we have capacity
bucket.consume(1, wait=True)

response = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={"model": "gpt-4", "messages": [{"role": "user", "content": prompt}]}
)
response.raise_for_status()
return response.json()

Token buckets are especially useful for batch workloads. Instead of guessing delays, you let the algorithm handle throttling automatically.

Handling 429 Responses in Production

Even with client-side rate limiting, you may hit the server limit (other clients may consume quota, or your estimate may be off). When you receive a 429:

  1. Always parse the Retry-After header if present. It is the authoritative signal.
  2. Back off exponentially if Retry-After is not present.
  3. Do not retry aggressively. Repeated 429s indicate you have exceeded your quota; more retries will not help.

Here is a robust handler:

def call_with_429_handling(
func,
max_retries: int = 3,
base_delay: float = 1.0
):
"""Retry on 429 with exponential backoff, respecting Retry-After."""
for attempt in range(max_retries + 1):
try:
return func()
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429:
# Check Retry-After header
retry_after_str = e.response.headers.get("retry-after")
if retry_after_str:
try:
# Retry-After can be seconds or HTTP-date
retry_after = float(retry_after_str)
except ValueError:
# Assume it's an HTTP-date; parse it
from email.utils import parsedate_to_datetime
retry_time = parsedate_to_datetime(retry_after_str)
retry_after = (retry_time - datetime.now(timezone.utc)).total_seconds()
else:
# Fall back to exponential backoff
retry_after = base_delay * (2 ** attempt)

if attempt < max_retries:
print(f"Rate limited. Waiting {retry_after:.1f}s...")
time.sleep(retry_after)
continue

raise

raise Exception("Max retries exceeded on 429 error")

Multi-Tier Rate Limiting

Advanced applications manage different rate limit buckets for different resources. For example, you might have separate limits for chat completions, embeddings, and fine-tuning. Track each separately:

class RateLimitManager:
"""Manage multiple token buckets for different API endpoints."""

def __init__(self):
self.buckets = {
"chat": TokenBucket(capacity=3500, refill_rate=3500 / 60),
"embeddings": TokenBucket(capacity=1000, refill_rate=1000 / 60),
"fine-tune": TokenBucket(capacity=100, refill_rate=100 / 60),
}

def consume(self, resource: str, tokens: int = 1) -> bool:
"""Consume tokens from the appropriate bucket."""
if resource not in self.buckets:
raise ValueError(f"Unknown resource: {resource}")
return self.buckets[resource].consume(tokens, wait=True)

Key Takeaways

  • Rate limits are quotas enforced by the API provider; a 429 response signals you have exceeded yours.
  • Always parse rate limit headers (x-ratelimit-remaining-requests, retry-after) to understand your current consumption.
  • Implement client-side rate limiting with token bucket algorithms to proactively throttle requests and avoid 429 errors.
  • Respect the Retry-After header when you do encounter 429; never retry aggressively on rate limit errors.
  • For critical workloads, implement separate buckets for different resource types to prioritize high-value requests.

Frequently Asked Questions

Is it better to use token buckets or just rely on exponential backoff?

Token buckets are better for steady-state workloads because they spread requests evenly and avoid clustering. Exponential backoff is good for occasional errors. Use both: token buckets for normal operation, exponential backoff for unexpected failures like 503.

What if my rate limit resets between requests I am processing?

Token buckets handle this automatically by tracking elapsed time and refilling tokens. Your effective quota increases as time passes, even within a single batch of requests.

Can I increase my rate limit?

Yes. Most API providers allow you to request higher limits based on your usage and account tier. Use the /account/limits or equivalent endpoint to check your current limits, then contact support. Meanwhile, use your current limits and implement efficient batching to maximize throughput.

Should I implement rate limiting if I am the only user of my API?

Yes. Rate limiting is good practice even for single users because it prevents accidental overages (e.g., infinite loops hammering the API). It is cheap insurance.

Further Reading