Rate Limiting LLM APIs: Tokens, Costs, and Quotas
Rate limiting LLM APIs is critical for three reasons: cost control, API compliance, and fairness. LLM APIs like OpenAI, Anthropic, and Google charge per token (not per request), with typical costs of $0.01–0.10 per 1M input tokens and $0.03–0.30 per 1M output tokens. A single careless loop could exhaust your monthly budget in seconds. Additionally, LLM APIs enforce strict rate limits (tokens/sec, requests/sec, concurrent connections), and exceeding them causes service degradation or account suspension. This article covers token-aware rate limiting, per-user quotas, cost tracking, and strategies to minimize API spending.
Understanding LLM API Rate Limits and Pricing
Most LLM APIs enforce limits in two dimensions:
| Dimension | Typical Limit | Why It Matters |
|---|---|---|
| Tokens per minute (TPM) | 10k–1M TPM | Total throughput ceiling. OpenAI GPT-4: 40k TPM on free tier, 90k on paid. |
| Requests per minute (RPM) | 3–10k RPM | Concurrency ceiling. Independent of token count. |
| Concurrent requests | 10–100 | Max simultaneous connections. Prevents connection pool exhaustion. |
| Cost per token | $0.01–0.30 per 1M | Input tokens cheaper than output tokens. GPT-4 input: $0.03/1M. |
Exceeding any limit triggers rate-limit errors (HTTP 429) and potential account suspension. Pricing is strictly linear: 2x tokens = 2x cost.
Token-Aware Rate Limiting
To stay within API limits, you must count tokens before sending requests and throttle based on tokens/sec, not requests/sec:
import asyncio
import tiktoken
from typing import List
class TokenBucket:
"""Token bucket for rate limiting based on tokens, not requests."""
def __init__(self, tokens_per_minute: int):
self.capacity = tokens_per_minute
self.tokens = tokens_per_minute
self.last_refill = asyncio.get_event_loop().time()
self.tokens_per_sec = tokens_per_minute / 60.0
async def consume(self, num_tokens: int) -> None:
"""
Wait until num_tokens are available, then consume them.
Blocks if bucket is empty.
"""
while self.tokens < num_tokens:
# Refill based on elapsed time.
now = asyncio.get_event_loop().time()
elapsed = now - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.tokens_per_sec,
)
self.last_refill = now
if self.tokens < num_tokens:
# Wait 10 ms before checking again.
await asyncio.sleep(0.01)
self.tokens -= num_tokens
# OpenAI GPT-4: 90k tokens/minute limit.
bucket = TokenBucket(tokens_per_minute=90000)
async def fetch_llm_with_token_limit(prompt: str, model: str = "gpt-4") -> str:
"""
Fetch LLM response while respecting token rate limit.
"""
# Count tokens before requesting.
enc = tiktoken.encoding_for_model(model)
prompt_tokens = len(enc.encode(prompt))
# Estimate output tokens (rough: assume 1-1.5x input).
estimated_output = int(prompt_tokens * 1.2)
total_tokens = prompt_tokens + estimated_output
# Wait for tokens to be available.
await bucket.consume(total_tokens)
# Send request (now within quota).
response = await aiohttp_llm_call(prompt, model)
return response
async def aiohttp_llm_call(prompt: str, model: str) -> str:
"""Actual API call (simplified)."""
import aiohttp
async with aiohttp.ClientSession() as session:
async with session.post(
"https://api.openai.com/v1/chat/completions",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
},
headers={"Authorization": f"Bearer {YOUR_API_KEY}"},
) as resp:
data = await resp.json()
return data["choices"][0]["message"]["content"]
This pattern ensures you never exceed the API's token limit, preventing 429 errors and billing surprises.
Per-User Quotas for Cost Control
In multi-tenant systems, enforce per-user quotas to prevent single users from exhausting your API budget:
import sqlite3
from datetime import datetime, timedelta
class UserQuotaManager:
"""Track and enforce per-user token budgets."""
def __init__(self, db_path: str = "quotas.db"):
self.db_path = db_path
self._init_db()
def _init_db(self) -> None:
"""Create quota table if missing."""
conn = sqlite3.connect(self.db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS user_quotas (
user_id TEXT PRIMARY KEY,
tokens_used INTEGER DEFAULT 0,
tokens_limit INTEGER DEFAULT 1000000, -- 1M tokens/month
reset_date DATE DEFAULT CURRENT_DATE
)
""")
conn.commit()
conn.close()
async def check_and_consume(
self,
user_id: str,
tokens_needed: int,
) -> bool:
"""
Check if user has quota; consume if yes, return False if no.
"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
# Check quota (reset if month changed).
today = datetime.now().date()
cursor.execute(
"SELECT tokens_used, tokens_limit, reset_date FROM user_quotas WHERE user_id = ?",
(user_id,),
)
row = cursor.fetchone()
if not row:
# New user: create quota.
cursor.execute(
"INSERT INTO user_quotas (user_id, tokens_limit, reset_date) VALUES (?, ?, ?)",
(user_id, 1000000, today),
)
tokens_used = 0
tokens_limit = 1000000
else:
tokens_used, tokens_limit, reset_date = row
# Reset if month changed.
if reset_date != today:
tokens_used = 0
cursor.execute(
"UPDATE user_quotas SET tokens_used = 0, reset_date = ? WHERE user_id = ?",
(today, user_id),
)
# Check if budget available.
if tokens_used + tokens_needed > tokens_limit:
conn.close()
return False
# Consume tokens.
cursor.execute(
"UPDATE user_quotas SET tokens_used = tokens_used + ? WHERE user_id = ?",
(tokens_needed, user_id),
)
conn.commit()
conn.close()
return True
quota_manager = UserQuotaManager()
async def process_with_quota(user_id: str, prompt: str) -> dict:
"""Process only if user has quota."""
enc = tiktoken.encoding_for_model("gpt-4")
prompt_tokens = len(enc.encode(prompt))
estimated_total = int(prompt_tokens * 1.3) # Rough estimate
has_quota = await quota_manager.check_and_consume(user_id, estimated_total)
if not has_quota:
return {
"status": 429,
"error": "Monthly token quota exceeded. Upgrade your plan.",
}
# Process request.
response = await fetch_llm_with_token_limit(prompt)
return {"status": 200, "response": response}
Per-user quotas prevent runaway costs and incentivize efficient prompt usage.
Cost Tracking and Alerting
Track API costs in real-time and alert when spending spikes:
import logging
from datetime import datetime
logger = logging.getLogger(__name__)
class CostTracker:
"""Track LLM API costs by model, user, and date."""
def __init__(self):
self.costs_by_date = {}
self.costs_by_model = {}
self.daily_budget = 100.0 # $100/day budget
def add_cost(
self,
model: str,
input_tokens: int,
output_tokens: int,
user_id: str = "unknown",
) -> None:
"""
Record a cost and check budget.
"""
# Model pricing (as of 2026).
pricing = {
"gpt-4": {"input": 0.00003, "output": 0.00006},
"gpt-4-turbo": {"input": 0.00001, "output": 0.00003},
"gpt-3.5-turbo": {"input": 0.0000005, "output": 0.0000015},
"claude-3-sonnet": {"input": 0.000003, "output": 0.000015},
}
if model not in pricing:
logger.warning(f"Unknown model {model}. Assuming gpt-3.5-turbo pricing.")
model = "gpt-3.5-turbo"
# Calculate cost.
input_cost = input_tokens * pricing[model]["input"]
output_cost = output_tokens * pricing[model]["output"]
total_cost = input_cost + output_cost
# Track by date.
today = datetime.now().date().isoformat()
self.costs_by_date[today] = self.costs_by_date.get(today, 0) + total_cost
# Track by model.
self.costs_by_model[model] = self.costs_by_model.get(model, 0) + total_cost
# Alert if daily budget exceeded.
if self.costs_by_date[today] > self.daily_budget:
logger.critical(
f"Daily budget exceeded: ${self.costs_by_date[today]:.2f} vs ${self.daily_budget:.2f}. "
f"Rejecting new requests."
)
def get_daily_cost(self, date: str = None) -> float:
"""Get total cost for a date."""
if not date:
date = datetime.now().date().isoformat()
return self.costs_by_date.get(date, 0.0)
def is_budget_ok(self) -> bool:
"""Check if daily budget is still OK."""
today = datetime.now().date().isoformat()
return self.costs_by_date.get(today, 0) < self.daily_budget
tracker = CostTracker()
async def process_with_cost_limit(prompt: str, model: str = "gpt-4") -> dict:
"""Process only if within budget."""
if not tracker.is_budget_ok():
return {
"status": 429,
"error": "Daily API budget exhausted. Retry tomorrow.",
}
# Fetch response.
response = await fetch_llm_with_token_limit(prompt, model)
# Track cost (would use actual token counts from API response).
enc = tiktoken.encoding_for_model(model)
input_tokens = len(enc.encode(prompt))
output_tokens = len(enc.encode(response))
tracker.add_cost(model, input_tokens, output_tokens)
today_cost = tracker.get_daily_cost()
return {
"status": 200,
"response": response,
"cost_today": today_cost,
}
Cost tracking prevents budget overruns and exposes inefficient prompts.
Request Retry with 429 Handling
LLM APIs return 429 (rate limit) when you exceed limits. Handle gracefully with exponential backoff:
import random
import aiohttp
async def fetch_with_429_retry(prompt: str, max_retries: int = 5) -> dict:
"""
Fetch with automatic 429 handling.
OpenAI returns Retry-After header; respect it.
"""
async with aiohttp.ClientSession() as session:
for attempt in range(max_retries):
try:
async with session.post(
"https://api.openai.com/v1/chat/completions",
json={
"model": "gpt-4",
"messages": [{"role": "user", "content": prompt}],
},
headers={"Authorization": f"Bearer {YOUR_API_KEY}"},
) as resp:
if resp.status == 429:
# Rate limited. Respect Retry-After.
retry_after = resp.headers.get("Retry-After", "60")
wait_seconds = int(retry_after)
logger.warning(f"Rate limited. Waiting {wait_seconds}s before retry.")
await asyncio.sleep(wait_seconds)
continue
if resp.status == 200:
data = await resp.json()
return {"status": 200, "response": data["choices"][0]["message"]["content"]}
raise RuntimeError(f"API error {resp.status}")
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter.
delay = (2 ** attempt) + random.uniform(0, 1)
logger.info(f"Retry {attempt + 1}/{max_retries} after {delay:.1f}s")
await asyncio.sleep(delay)
Respecting Retry-After headers prevents the thundering herd problem and increases success rates.
Key Takeaways
- Token limits are per-API; count tokens before requesting: Use
tiktokento estimate tokens and respect TPM budgets. - Per-user quotas prevent runaway costs: Track tokens/month per user; reject requests exceeding quota.
- Cost tracking alerts on budget overruns: Monitor daily costs; reject new work if budget exhausted.
- Handle 429 errors with exponential backoff: Respect
Retry-Afterheaders; scale down concurrency on rate limits. - Model choice dramatically affects costs: GPT-3.5-turbo is 100x cheaper than GPT-4; consider hybrid strategies (GPT-3.5 for simple, GPT-4 for complex).
Frequently Asked Questions
How do I estimate output tokens before the API call?
Rough: output tokens are typically 0.8–1.5x input tokens. For more accuracy, call the API with max_tokens=limit and the API returns actual tokens used. Or use a smaller test model (GPT-3.5) to estimate, then call your target model.
Should I share one API key across users or create separate keys per user?
Separate keys per user. If you share a key and one user exhausts the quota, all users are blocked. Also, you can't track usage per user accurately.
What's the cheapest way to use LLMs at scale?
Use GPT-3.5-turbo for straightforward tasks (100–1000x cheaper than GPT-4). Reserve GPT-4 for complex reasoning. Use batching (OpenAI Batch API) for non-urgent work (50% discount, 24-hour turnaround).
How do I handle quota reset (monthly/weekly)?
Track reset_date in the quota table. When today > reset_date, reset tokens_used to 0 and update reset_date. Cron job or periodic check at request time both work.
Can I use multiple API keys to bypass rate limits?
Yes, technically, but be cautious: some providers detect and throttle coordinated attacks. Better: scale to multiple models or providers (see article 8).
Further Reading
- OpenAI Rate Limits Documentation — official limits and quotas.
- Anthropic Claude API Pricing — claude-3 models and costs.
- tiktoken: Token Encoder — official token counter.
- Cost Tracking Tools for LLMs — LangSmith cost monitoring and debugging.