Skip to main content

Count LLM Tokens: How to Calculate Input & Output Costs

Token counting is the foundation of LLM cost optimization. Before you can optimize spend, you must measure exactly how many tokens each request consumes at input and output. A tokenizer is a deterministic algorithm that splits text into tokens according to a provider's vocabulary (Claude, GPT-4, Mistral, and Gemini each use different tokenizers, producing different token counts for the same text). Anthropic publishes an open-source tokenizer (Tiktoken for Python, a JavaScript library for browsers) that lets you count input tokens offline before sending a request; most providers also publish official token-counting APIs. By instrumenting your application to log input and output token counts alongside each request, you gain the data needed to calculate per-request cost, identify high-cost features, and forecast monthly spend with precision. A typical production system will measure token counts for 80% of requests (sampling is acceptable) to profile average cost across request types.

Official Tokenizers and APIs

Each major LLM provider publishes a tokenizer you can use to count tokens before making an API call. Anthropic's tokenizer is available as anthropic Python SDK method or the js-tokenizer NPM package; both decode text into exact token IDs matching the server-side accounting, so there are no surprises at billing time. OpenAI publishes tiktoken for Python and JavaScript. Google provides google-generativeai SDK with token-counting methods. Using the official tokenizer is non-negotiable: hand-counting words (dividing by 3.5) gives you a ballpark estimate but will be off by 20–30%, leading to cost surprises and inaccurate forecasts. The API also matters: all providers expose a synchronous token-counting endpoint (sometimes part of the main completions endpoint, sometimes separate) that lets you count tokens for a batch of prompts in a single call, making it fast to audit historical messages or run cost sensitivity analyses.

For Anthropic Claude, token counting is integrated into the SDK:

from anthropic import Anthropic

client = Anthropic()

# Synchronously count tokens before making a request
message = client.messages.count_tokens(
model="claude-3-5-sonnet-20241022",
system="You are a helpful assistant.",
messages=[
{
"role": "user",
"content": "What is the capital of France?",
}
],
)

print(f"Input tokens: {message.input_tokens}")
# Returns: Input tokens: 26

This call returns the exact token count the API will charge you if you make the same request. By wrapping your production completions calls with a prior token-count call, you can log and forecast spend before incurring it. Similarly, after a completion, the response object includes usage.output_tokens, so you can calculate the exact cost retrospectively: input_tokens × input_rate + output_tokens × output_rate = cost_usd.

Building a Cost Calculation Pipeline

A cost calculation pipeline integrates token counting into your request flow, logs costs, and aggregates them by feature, user, or time window. The pattern is:

  1. Before request: Call the token-counting API with your system prompt, user message, and any context.
  2. Log forecast: Store (input_tokens, model, timestamp) in a cost ledger.
  3. Make request: Send the actual completion call.
  4. Log actual: Extract usage.input_tokens and usage.output_tokens from the response, record the actual cost.
  5. Aggregate: Roll up costs by feature, user, date, or model for dashboarding and alerting.

Here is a Python example that logs costs per request:

import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

# Pricing for Claude 3.5 Sonnet (June 2026)
PRICING = {
"claude-3-5-sonnet-20241022": {
"input": 3.0, # per million tokens
"output": 15.0, # per million tokens
}
}

def calculate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
"""Calculate cost in USD for a single request."""
pricing = PRICING.get(model, {})
input_cost = (input_tokens * pricing.get("input", 0)) / 1_000_000
output_cost = (output_tokens * pricing.get("output", 0)) / 1_000_000
return input_cost + output_cost

def log_cost_event(feature: str, user_id: str, input_tokens: int,
output_tokens: int, model: str, cost: float):
"""Log a cost event for later aggregation."""
event = {
"timestamp": datetime.utcnow().isoformat(),
"feature": feature,
"user_id": user_id,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"model": model,
"cost_usd": round(cost, 6),
}
# In production, write this to a database or log aggregator.
print(json.dumps(event))

# Example: answering a support question
def answer_support_question(question: str, user_id: str, context: str = "") -> str:
"""Answer a customer support question with cost tracking."""
model = "claude-3-5-sonnet-20241022"
system_prompt = (
"You are a customer support specialist. Answer questions concisely "
"in 2–3 sentences. Suggest next steps if needed."
)

# Count tokens before making request
count_response = client.messages.count_tokens(
model=model,
system=system_prompt,
messages=[
{
"role": "user",
"content": f"Customer context: {context}\n\nQuestion: {question}",
}
],
)
input_tokens = count_response.input_tokens

# Make actual request
response = client.messages.create(
model=model,
max_tokens=150,
system=system_prompt,
messages=[
{
"role": "user",
"content": f"Customer context: {context}\n\nQuestion: {question}",
}
],
)

output_tokens = response.usage.output_tokens
actual_input_tokens = response.usage.input_tokens
cost = calculate_cost(actual_input_tokens, output_tokens, model)

# Log the cost event
log_cost_event(
feature="support_qa",
user_id=user_id,
input_tokens=actual_input_tokens,
output_tokens=output_tokens,
model=model,
cost=cost,
)

return response.content[0].text

# Sample call
answer_support_question(
question="How do I update my billing address?",
user_id="user_12345",
context="Customer since 2024, Premium plan, no active issues",
)

This pipeline logs one JSON event per request. In production, pipe these to a database (PostgreSQL, BigQuery, Snowflake) or a log aggregator (DataDog, Splunk, ELK), then query them to answer questions like "What was the average cost per support interaction last week?" or "Which feature burned the most budget?".

Forecasting Monthly Spend from Sample Data

Once you have 100–500 representative requests logged, you can forecast monthly spend by multiplying average cost per request by expected daily volume. For example, if 200 sampled support questions averaged $0.0085 per request (input+output), and you expect 1,000 support questions per day, your monthly forecast is: 0.0085 × 1000 × 30 = $255. Add a 20% buffer for variance (more complex queries, context creep, retry failures): $306/month. This forecast is far more reliable than hand-waving estimates, and lets you size budgets, negotiate volume discounts with your provider, or decide whether to invest in optimization.

Handling Multi-Turn Conversations and Context Growth

In conversational interfaces, each turn (user message + assistant response) consumes tokens, but you must also count the conversation history that gets included in subsequent messages (the full conversation is replayed to the model for context). A two-turn conversation with a 10-token user query and 50-token response consumes 60 tokens on turn 1. On turn 2, if you replay both the user and assistant's prior messages, turn 2 consumes 60 (history) + 10 (new query) + 50 (new response) = 120 tokens, double the cost. In a ten-turn conversation, the token cost grows superlinearly. To forecast conversation costs: estimate typical conversation length (4 turns average), measure tokens per turn (e.g., 80 tokens/turn), then multiply: 4 × 80 = 320 tokens per conversation. If your system handles 500 conversations daily, that's 320 × 500 = 160,000 tokens daily, or 4.8 million tokens monthly. At $3 input + $15 output per million (split roughly 70:30 input-to-output for conversations), that's approximately $(4.8 × 3 × 0.7) + (4.8 × 15 × 0.3) = $10.08 + $21.60 = $31.68/month. This calculation reveals the cost impact of conversation length; you can optimize by periodically summarizing old messages or using a separate "long-term memory" store instead of replaying full history.

Auditing and Validating Token Counts

After deploying cost logging, audit your token counts monthly to catch surprises. Common issues: (1) the tokenizer library is outdated, producing off-by-1-2% counts; (2) your system prompt is longer than you realized (sometimes embedding vector definitions or few-shot examples bloat it); (3) certain user inputs (emoji, code, non-English text) tokenize unexpectedly densely; (4) context documents are being included when they shouldn't be. To audit, sample 50–100 recent requests from your logs, manually check that their logged input/output token counts match the API response usage fields (by inspecting archived API responses or running the same request again via token-counting API), and reconcile any discrepancies. If there's systematic bias (your logs show 10% fewer tokens than the API charged), investigate and fix the cause immediately—it indicates you're underestimating actual spend.

Key Takeaways

  • Use only official tokenizers (Anthropic SDK, OpenAI Tiktoken, Google's SDK) to count tokens; hand-estimates are off by 20–30%.
  • Log input, output, model, feature, and user ID for every request to enable cost attribution and forecasting.
  • Calculate cost per request: (input_tokens × input_rate + output_tokens × output_rate) / 1_000_000.
  • Forecast monthly spend by multiplying average cost per request by daily volume, then add 20% buffer for variance.
  • Audit token counts monthly by comparing logged counts against API response usage fields to catch systematic errors.

Frequently Asked Questions

Should I count tokens for every request or sample?

For first-pass cost estimation, sample 100–200 requests across different features and user segments. Once in production, logging is cheap (microseconds) so instrument ~80% of requests (or all, if your traffic is low). The logged data drives alerting and optimization, so higher coverage is better.

What if my tokenizer library is outdated?

Official tokenizer libraries are updated whenever the model's vocabulary changes (every 3–6 months typically). Check your dependency management; pip/npm show the release date. If you're more than 2 releases behind, upgrade. The token count differences are usually small (1–3%) but add up across millions of requests.

How do I handle system prompts that change frequently?

Either (1) version your system prompts and log the version/hash alongside costs, or (2) always count tokens with the full current system prompt so your logs are accurate. Don't hardcode a static token estimate for system prompts; it drifts over time.

Can I estimate output tokens before making a request?

Not reliably. Output token count depends on the model's behavior—a simple query might generate 50 tokens, a complex one 1,000. You can set a max_tokens limit to cap output cost, but you can't forecast the actual output tokens without running the request. This is why output cost is harder to optimize than input cost; you control output indirectly via prompt engineering, not via direct parameter settings.

How do I handle API errors and retries in cost logs?

Log both successes and failures separately. If a request fails (e.g., rate limit, model error) and you retry, that's a separate cost event. Aggregating failures helps you identify runaway retry loops (a common cost driver) and fix them.

Further Reading