LLM Cost Optimization: Understand Token Economics
LLM cost optimization begins with understanding how language model APIs charge you: they measure usage in tokens, not words or requests, and bill both input (prompt) and output (completion) separately at different rates. A token is a unit of text—roughly 3–4 English words, or a single Unicode character—that the model's tokenizer breaks a request into. When you send a prompt to Claude, GPT-4, or Mistral, the API counts input tokens (everything you send) and output tokens (everything the model generates), multiplies each by their published per-million rates, and bills you the sum. Across the industry as of June 2026, input tokens cost $0.50–$5 per million depending on model tier, while output tokens run 2–10× more expensive, making output length the dominant cost lever in most applications.
What Are Tokens and Why They Matter
A token is the atomic unit for LLM cost and latency. The phrase "Hello, world!" tokenizes as three tokens (Hello, ,, world!), while a JSON array of 100 objects might tokenize to 500 tokens. Unlike word or character counting, tokenization is non-linear: special characters, punctuation, and certain language patterns affect token density. A typical English paragraph (100 words) consumes 130–150 tokens, meaning a million tokens represents roughly 6,500–7,500 words. For Anthropic's Claude 3.5 Sonnet (as of June 2026), input tokens cost $3 per million while output tokens cost $15 per million—a 5× multiplier on generation. This asymmetry means that a request generating 1,000 output tokens (about 750 words) incurs $15 in generation cost alone, while the input prompt might cost only $3–$6. Understanding this is foundational: you optimize output length first, then input length, then model selection.
Input vs Output Token Costs
Input tokens are the tokens in your system prompt, context documents, and user query combined. Output tokens are whatever the model generates in response. Nearly every LLM provider charges less for input than output (typically 1:3 to 1:10 ratio), incentivizing you to pre-compute and cache long context rather than regenerate it repeatedly. For example, with Claude 3.5 Sonnet, a 10,000-token context document with 100-token query costs $30 in input; if the model outputs 500 tokens, that adds $7.50 (total $37.50). Running the same query five times costs $37.50 × 5 = $187.50. But if you cache the context (most providers offer 5–30% cache discounts), the second and subsequent calls reuse that context at cache rates, dropping the per-call cost to $5–$10 per output—a 75% savings. Output length is the cost lever you control most directly via prompt engineering: asking for "a three-sentence summary" instead of a five-paragraph essay saves money immediately.
The Six Cost Drivers in LLM Systems
Cost in production LLM systems comes from six levers: model choice (tier), prompt length (input tokens), output length (completion tokens), request frequency, latency tolerance, and cache utilization. Model choice is binary: use a cheaper model for simple tasks, a more capable (and expensive) model for hard reasoning. Claude 3.5 Sonnet costs 3× Haiku but handles complex multimodal tasks that Haiku fails. A support chatbot answering FAQ questions might route 80% of traffic to Haiku (cheaper) and 20% to Sonnet (for edge cases requiring reasoning), cutting average cost per request by 40%. Prompt length is variable: you control whether you include 100 or 10,000 tokens of context. Output length is negotiable: you ask the model to return summaries, bullet points, or JSON instead of prose, cutting token count by 50–70%. Request frequency is load: a system handling 1,000 daily requests costs 1,000× less than one handling 1 million. And cache hits—when a large prompt (or system message) is reused—drop cost by 80–90% on cached portions, a game-changer for Q&A over fixed document sets or ongoing conversations.
The cost formula is transparent: (input_tokens × input_price_per_million) / 1_000_000 + (output_tokens × output_price_per_million) / 1_000_000 = cost_usd_per_request. For a Sonnet request with 2,000 input tokens and 300 output tokens: (2000 × 3) / 1_000_000 + (300 × 15) / 1_000_000 = $0.006 + $0.0045 = $0.0105 (about 1 cent). Scale that to 1 million requests monthly: $10,500 monthly. Small optimizations compound: reducing output tokens to 200 drops the request cost to $0.009 and monthly bill to $9,000 (14% savings). Reducing input tokens to 1,500 saves another $0.0015 per request (14% more), landing you at $7,650. And routing 30% of traffic to Haiku (costing 80% less) saves an additional 24%, bottoming out at $5,814 per month—a 44% reduction from the unoptimized baseline, all from parameter tuning alone.
Pricing Models and Provider Variability
LLM API pricing varies significantly across providers and model tiers as of 2026. Anthropic (Claude) bills input and output per-million-token rates with a new prompt caching tier that charges 90% less for cached tokens (15-min sliding window). OpenAI (GPT-4) uses a similar per-token model but does not offer caching discounts at the same scale. Mistral charges the lowest rates ($0.14 input / $0.42 output per million for Mistral 7B) but sacrifices some quality on reasoning tasks. Google Gemini splits pricing by region and model size. AWS Bedrock offers on-demand and provisioned throughput models; provisioned costs 1/3 of on-demand if you commit to a fixed number of tokens per minute. The takeaway: compare total monthly spend across models for your workload, not per-token rates in isolation. A 3× cheaper model that requires 2× input retries because it fails more often is no savings.
Understanding Cost Baselines and Optimization Targets
A cost baseline is your current unoptimized monthly spend; optimization targets are the spend reductions you aim for. Most teams discover their baseline empirically by observing the first month of production traffic and calculating spend per active user, per feature, or per request type. A typical SaaS chatbot running Claude Opus (premium tier) might spend $1,000 for 50,000 daily interactions—$0.02 per request. The optimization target might be $0.01 per request (50% reduction) by routing simple queries to Haiku, caching FAQs, and compressing prompts. A data-labeling service doing annotation with Claude might spend $50 per 1,000 annotated examples; the target could be $15 by batch-processing off-peak (with batch API discounts) and using smaller models for simpler labels. Setting baselines and targets upfront forces accountability and prevents cost creep as traffic grows.
Key Takeaways
- LLM APIs charge per-token for input and output separately; output tokens typically cost 2–10× more than input.
- A single optimization (shorter output, cheaper model, cache reuse) typically saves 10–30% per request; combined optimizations compound to 40–60% total savings.
- The cost formula is deterministic and measurable:
(input_tokens × rate) + (output_tokens × rate)per request, so you can forecast spend accurately. - Six levers control cost: model choice, prompt length, output length, request frequency, latency tolerance, and cache use.
- Establish a cost baseline (current unoptimized spend) and a target (desired spend) for accountability and measurement.
Frequently Asked Questions
How many tokens does a typical user query use?
A short customer-support query ("How do I reset my password?") tokenizes to 10–20 tokens. A multi-paragraph email with history might be 300–500 tokens. A full customer record (name, order history, support tickets) dumped into a context window can run 5,000–20,000 tokens. Use an online tokenizer (OpenAI's or Anthropic's estimator) to test your typical payloads and multiply by your daily request volume for a rough monthly spend projection.
Why does my output cost more than my input?
Output tokens are more expensive because generating text (autoregressive decoding) is computationally heavier than reading/embedding input. The model must run a forward pass for each output token it generates, whereas input is processed in parallel. Additionally, scarcity and demand: if most of your requests are long outputs (essays, code, analyses), you're paying a premium for a constrained resource.
Can I estimate my monthly spend before launch?
Yes: estimate daily request volume, measure the average input and output tokens for a representative request (or test batch), and multiply by 30 days. Plug into the formula: daily_requests × (avg_input_tokens × input_rate + avg_output_tokens × output_rate) / 1_000_000 × 30. Add 20% buffer for variance and you have a reasonable forecast.
Is it always worth optimizing for cost?
Not always. If your monthly spend is below $100, optimization effort (engineering time, testing, complexity) likely outweighs savings. But above $1,000/month, a 30% cost reduction ($300/month = $3,600/year) justifies 2–3 weeks of optimization work. Above $10,000/month, you should dedicate an engineer full-time to cost reduction.
Which model should I default to?
Default to the smallest/cheapest model that solves your problem reliably. For classification, summarization, and simple extraction: use Haiku (or Mistral 7B if on AWS). For reasoning, code generation, and complex analysis: use Sonnet. Reserve Opus for multi-step reasoning or real-time interaction where latency and accuracy are critical.
Further Reading
- Anthropic Prompt Caching Guide — Official documentation on 90% discount for cached tokens.
- OpenAI Token Counter and Pricing — Token counting and per-token rate reference.
- Google Gemini Pricing and Models — Pricing across Gemini model tiers and regions.
- AWS Bedrock Pricing and Throughput — On-demand vs. provisioned throughput cost comparison.