Rate Limiting AI Requests: Prevent Abuse
Without rate limiting, a malicious user can exhaust your LLM budget in minutes. A single loop making 1000 requests per second will cost thousands of dollars and degrade service for legitimate users. Rate limiting is a network defense that caps request volume per user or API key, allowing you to maintain service quality and predictable costs. This article covers the token bucket algorithm, distributed rate limiting across multiple servers, and how to communicate limits to users.
What is Rate Limiting?
Rate limiting is a traffic control mechanism that restricts the number of requests a user or API key can make within a time window (e.g., 100 requests per minute). When the limit is exceeded, the server rejects further requests with a 429 (Too Many Requests) HTTP status and includes a Retry-After header telling the client when to retry. Rate limiting prevents both accidental abuse (a client with a buggy loop) and malicious attacks (a bot trying to enumerate valid prompts or steal credentials).
Token Bucket Algorithm
The token bucket algorithm is the most popular rate-limiting strategy for APIs:
- Each user/API key gets a bucket that can hold
Ntokens (capacity). - Tokens are added at a fixed rate (e.g., 10 tokens per second).
- Each request costs 1 token. If no tokens are available, the request is denied.
- Tokens that accumulate but are not used "refill" the bucket over time, allowing bursts.
This allows users to make occasional bursts of requests (up to the bucket capacity) while maintaining an average rate.
Redis-Based Token Bucket
# Rate limiting with Redis token bucket
import redis
import math
from datetime import datetime
class TokenBucketRateLimiter:
def __init__(
self,
redis_url: str,
rate_per_second: float = 10.0,
bucket_capacity: int = 100
):
self.redis = redis.from_url(redis_url)
self.rate_per_second = rate_per_second
self.bucket_capacity = bucket_capacity
async def is_allowed(self, key: str) -> bool:
"""Check if a request is allowed and consume a token if so."""
# Use a Lua script for atomic read-modify-write
# This prevents race conditions in distributed systems
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens based on elapsed time
local elapsed = math.max(0, now - last_refill)
tokens = math.min(capacity, tokens + elapsed * rate)
-- Try to consume one token
if tokens >= 1 then
tokens = tokens - 1
redis.call('HSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600) -- Expire old buckets after 1 hour
return 1 -- Allowed
else
return 0 -- Denied
end
"""
script_hash = self.redis.script_load(lua_script)
now = datetime.utcnow().timestamp()
result = self.redis.evalsha(
script_hash,
1,
key,
self.bucket_capacity,
self.rate_per_second,
now
)
return bool(result)
FastAPI Middleware for Rate Limiting
# Integrate token bucket into your API
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
app = FastAPI()
limiter = TokenBucketRateLimiter(
redis_url=os.environ["REDIS_URL"],
rate_per_second=10.0,
bucket_capacity=100
)
@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
"""Check rate limits before processing any request."""
# Extract the user/API key identifier
# For authenticated users, use organization_id; for API keys, use key_id
identifier = get_rate_limit_identifier(request)
if not identifier:
return JSONResponse(
status_code=401,
content={"error": "Unauthorized"}
)
# Check rate limit
allowed = await limiter.is_allowed(f"ratelimit:{identifier}")
if not allowed:
return JSONResponse(
status_code=429,
content={"error": "Rate limit exceeded"},
headers={
"Retry-After": "1", # Try again in 1 second
"X-RateLimit-Limit": "100",
"X-RateLimit-Remaining": "0"
}
)
response = await call_next(request)
# Add rate limit info to response headers
response.headers["X-RateLimit-Limit"] = "100"
response.headers["X-RateLimit-Remaining"] = str(
max(0, 100 - await limiter.is_allowed(f"ratelimit:{identifier}"))
)
return response
def get_rate_limit_identifier(request: Request) -> str | None:
"""Extract rate limit key from request."""
# Check for JWT token
auth_header = request.headers.get("Authorization", "")
if auth_header.startswith("Bearer "):
try:
token = auth_header.replace("Bearer ", "")
payload = jwt.decode(token, JWT_SECRET, algorithms=["HS256"])
return f"user:{payload['organization_id']}"
except:
pass
# Check for API key
api_key = request.headers.get("X-API-Key", "")
if api_key:
key_info = validate_api_key(api_key)
if key_info:
return f"key:{key_info['key_id']}"
# Fall back to IP address (less reliable, but better than nothing)
ip = request.client.host
return f"ip:{ip}"
Tiered Rate Limits
Different plans should have different limits. Free tier users get lower limits; premium users get higher limits.
# Tiered rate limiting based on subscription plan
@dataclass
class RateLimitTier:
name: str
requests_per_minute: int
requests_per_hour: int
requests_per_day: int
RATE_LIMIT_TIERS = {
"free": RateLimitTier("Free", 10, 500, 5000),
"pro": RateLimitTier("Pro", 100, 10000, 100000),
"enterprise": RateLimitTier("Enterprise", 1000, 100000, 1000000)
}
async def get_rate_limit_for_user(organization_id: str) -> RateLimitTier:
"""Fetch the user's subscription plan and return rate limit."""
org = db.query(Organization).filter(
Organization.id == organization_id
).first()
if not org:
return RATE_LIMIT_TIERS["free"]
plan = org.subscription_plan or "free"
return RATE_LIMIT_TIERS.get(plan, RATE_LIMIT_TIERS["free"])
Handling Rate Limit Errors on the Client
Clients should respect rate limit headers and implement exponential backoff.
// Client-side exponential backoff
async function makeRequestWithRetry(url, options, maxRetries = 3) {
let retryCount = 0;
while (retryCount < maxRetries) {
try {
const response = await fetch(url, options);
if (response.status === 429) {
// Rate limited; wait and retry
const retryAfter = response.headers.get("Retry-After") || "1";
const waitMs = parseInt(retryAfter) * 1000;
console.warn(`Rate limited. Retrying in ${waitMs}ms`);
await new Promise(resolve => setTimeout(resolve, waitMs));
retryCount++;
continue;
}
return response;
} catch (error) {
console.error(`Request failed: ${error}`);
throw error;
}
}
throw new Error("Max retries exceeded");
}
Rate Limiting Strategies Comparison
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Per-second | Simple, fast decision | Too strict for bursty traffic | Protecting from DoS |
| Token bucket | Allows bursts, fair | Requires distributed state | API SaaS (recommended) |
| Sliding window | Accurate rate | Complex, memory-intensive | Financial APIs |
| Fixed window | Simple, low overhead | Allows boundary spikes | Non-critical services |
Key Takeaways
- Implement token bucket rate limiting with Redis to allow bursts while maintaining average rate.
- Use a Lua script for atomic read-modify-write to avoid race conditions in distributed systems.
- Different subscription tiers should have different rate limits (free: 10 req/min, pro: 100 req/min).
- Always return
Retry-Afterheaders with 429 responses so clients know when to retry. - Monitor rate limit metrics to detect unusual traffic patterns (possible abuse).
Frequently Asked Questions
What happens if a user hits their rate limit mid-request?
Deny the request immediately with 429. Do not start processing the LLM call; it will waste compute and cost money. The rate limiting middleware should check limits before the request reaches your core logic.
How do I handle legitimate burst traffic?
Use a token bucket with a capacity higher than the rate. For example, 10 tokens/sec with 100-token capacity allows a user to make 10 requests instantly, then refill at 10/sec. This handles legitimate bursts (e.g., a user loading multiple pages at once) while preventing sustained abuse.
Should I rate limit by IP or by user?
By user/API key, preferably. Rate limiting by IP is unreliable (multiple users behind a NAT will share an IP) and prone to false positives. Always authenticate requests and limit by authenticated identity.
How do I detect and respond to rate limit abuse?
Monitor users who consistently hit their limits. Log these events. If a user exceeds their limit >50 times per day, send them an email offer to upgrade to a higher tier. If abuse continues, consider temporarily disabling their API key.