Skip to main content

Token Metrics & Latency: Measure LLM Performance

Token counting and latency measurement are foundational to LLM observability. Every inference call generates tokens (input and output) and incurs a cost proportional to those tokens. Latency varies based on model load, network conditions, and response length. Measuring both enables cost attribution, SLA enforcement, and performance optimization. Unlike traditional APIs where latency is measured end-to-end, LLM APIs benefit from granular latency breakdown: time-to-first-token (TTFT, the latency until the first output token arrives) reveals queue saturation, while total latency reveals whether the model is slow or the response is just long.

Token Counting: Pre-Call and Post-Call

There are two approaches to token counting:

Pre-Call Estimation

Before invoking the LLM API, estimate the input and output token counts. This allows cost forecasting and quota checks before spending money.

import anthropic

# Pre-call token estimation (using Anthropic's token counting)
def estimate_tokens_and_cost(model: str, messages: list[dict], max_tokens: int) -> tuple[int, float]:
"""Estimate tokens and cost before making an API call."""

client = anthropic.Anthropic()

# Anthropic provides token counting via count_tokens
input_tokens = client.beta.messages.count_tokens(
model=model,
system="You are a helpful assistant.",
messages=messages
)

# Estimate output tokens (rough heuristic)
max_output_tokens = min(max_tokens, 1024) # Assume response is max_tokens or 1024, whichever is smaller
estimated_output_tokens = max_output_tokens

# Mock pricing for Claude 3 Opus (2026 rates)
input_cost_per_token = 0.000015
output_cost_per_token = 0.000075

estimated_cost = (input_tokens.input_tokens * input_cost_per_token +
estimated_output_tokens * output_cost_per_token)

return input_tokens.input_tokens + estimated_output_tokens, estimated_cost

# Example
messages = [{"role": "user", "content": "What is machine learning?"}]
est_tokens, est_cost = estimate_tokens_and_cost("claude-3-opus-20250219", messages, 1024)
print(f"Estimated tokens: {est_tokens}, cost: ${est_cost:.6f}")

# Quota check
if est_cost > 0.10: # Reject if over $0.10
print("Request exceeds budget; rejecting.")
else:
# Proceed with API call
pass

Pre-call estimation prevents runaway costs by rejecting expensive requests before they reach the API.

Post-Call Actual Counts

After the API call completes, the provider returns the actual token counts consumed. Use these for billing and real cost tracking.

from anthropic import Anthropic

def chat_with_cost_tracking(user_message: str) -> dict:
"""Make an LLM call and track actual token usage and cost."""

client = Anthropic()

response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)

# Extract actual token counts from response
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
total_tokens = input_tokens + output_tokens

# Calculate actual cost (2026 Anthropic pricing)
input_cost = input_tokens * 0.000015
output_cost = output_tokens * 0.000075
total_cost = input_cost + output_cost

return {
"response": response.content[0].text,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": total_tokens,
"cost_usd": total_cost,
"stop_reason": response.stop_reason
}

# Example
result = chat_with_cost_tracking("Explain quantum computing in 100 words.")
print(f"Input: {result['input_tokens']}, Output: {result['output_tokens']}, Cost: ${result['cost_usd']:.6f}")

Actual counts from the API are ground truth for cost attribution.

Latency Breakdown: Total, TTFT, and Generation Time

Total latency for an LLM API call comprises three components:

ComponentDefinitionImportance
Queue wait timeTime from request submission to API acknowledgmentIndicates API saturation; if high, queue is backlogged
Time-to-first-token (TTFT)Latency from request submission to first output tokenUser perceives first token arrival as response start
Generation timeTime to generate all remaining tokens after the firstLong responses have high generation time

Here is how to measure each:

import time
from anthropic import Anthropic

def chat_with_latency_breakdown(user_message: str) -> dict:
"""LLM call with latency breakdown (TTFT, generation, total)."""

client = Anthropic()

# Measure total time
start_time = time.perf_counter()

first_token_time = None
total_tokens = 0
full_response = ""

# Use streaming to measure time-to-first-token
with client.messages.stream(
model="claude-3-opus-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
) as stream:
for text in stream.text_stream:
if first_token_time is None:
# Record time of first token
first_token_time = time.perf_counter()
full_response += text
total_tokens += 1 # Increment token count (approximate)

end_time = time.perf_counter()

# Calculate durations
ttft_ms = (first_token_time - start_time) * 1000
generation_ms = (end_time - first_token_time) * 1000
total_ms = (end_time - start_time) * 1000

return {
"response": full_response,
"ttft_ms": round(ttft_ms, 2),
"generation_ms": round(generation_ms, 2),
"total_ms": round(total_ms, 2),
"approximate_tokens": total_tokens
}

# Example
result = chat_with_latency_breakdown("Write a 500-word essay on AI ethics.")
print(f"TTFT: {result['ttft_ms']} ms")
print(f"Generation: {result['generation_ms']} ms")
print(f"Total: {result['total_ms']} ms")

Collecting Metrics with Prometheus

For production systems, expose metrics to Prometheus or a similar time-series database. Here is an example using the prometheus_client library:

from prometheus_client import Counter, Histogram, Gauge
from anthropic import Anthropic
import time

# Define metrics
llm_calls_total = Counter(
'llm_calls_total',
'Total LLM API calls',
['model', 'status']
)

llm_input_tokens_total = Counter(
'llm_input_tokens_total',
'Total input tokens consumed',
['model']
)

llm_output_tokens_total = Counter(
'llm_output_tokens_total',
'Total output tokens generated',
['model']
)

llm_cost_usd_total = Counter(
'llm_cost_usd_total',
'Total cost in USD',
['model']
)

llm_latency_seconds = Histogram(
'llm_latency_seconds',
'LLM call latency in seconds',
['model'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

llm_ttft_seconds = Histogram(
'llm_ttft_seconds',
'Time to first token in seconds',
['model'],
buckets=[0.05, 0.1, 0.2, 0.5, 1.0]
)

def chat_with_metrics(user_message: str, model: str = "claude-3-opus-20250219"):
"""LLM call with Prometheus metrics collection."""

client = Anthropic()
start_time = time.perf_counter()

try:
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)

end_time = time.perf_counter()
latency = end_time - start_time

input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
cost = input_tokens * 0.000015 + output_tokens * 0.000075

# Record metrics
llm_calls_total.labels(model=model, status='success').inc()
llm_input_tokens_total.labels(model=model).inc(input_tokens)
llm_output_tokens_total.labels(model=model).inc(output_tokens)
llm_cost_usd_total.labels(model=model).inc(cost)
llm_latency_seconds.labels(model=model).observe(latency)

return response.content[0].text

except Exception as e:
end_time = time.perf_counter()
latency = end_time - start_time

llm_calls_total.labels(model=model, status='error').inc()
llm_latency_seconds.labels(model=model).observe(latency)
raise

# Example usage
chat_with_metrics("What is your name?")

Prometheus scrapes these metrics every 15 seconds and stores them in a time-series database. You can then query: "What is the 95th percentile of LLM latency?" or "Total cost in the last 24 hours?" Grafana can visualize these metrics in dashboards.

Alerting on Metrics

Once you have metrics, define alert rules. Here is a Prometheus alert configuration (YAML) that triggers when latency or cost anomalies are detected:

groups:
- name: llm_alerts
rules:
- alert: HighLLMLatencyP95
expr: histogram_quantile(0.95, llm_latency_seconds) > 3
for: 5m
annotations:
summary: "LLM latency p95 exceeds 3 seconds"

- alert: HighDailyTokenCost
expr: increase(llm_cost_usd_total[24h]) > 100
for: 1h
annotations:
summary: "Daily LLM cost exceeded $100"

- alert: HighErrorRate
expr: |
(increase(llm_calls_total{status="error"}[5m]) /
increase(llm_calls_total[5m])) > 0.05
for: 5m
annotations:
summary: "LLM error rate exceeds 5%"

These rules fire when conditions are met, triggering notifications (Slack, PagerDuty, email).

Key Takeaways

  • Token counting enables cost attribution and quota enforcement; estimate before the call, record actual counts after.
  • Latency breakdown (TTFT, generation time, total) reveals bottlenecks: high TTFT indicates API saturation; long generation indicates verbose responses.
  • Prometheus metrics (counters for tokens and cost, histograms for latency) enable dashboards and alerting.
  • Time-to-first-token is a key UX metric: users perceive responsiveness based on TTFT, not total latency.
  • Cost tracking per inference, user, and model enables chargeback and optimization.

Frequently Asked Questions

How accurate is pre-call token estimation?

Pre-call estimates are typically within 10% of actual counts. The tokenizer algorithm is deterministic, but some formatting and edge cases may vary. Always use actual post-call counts for billing.

What is a good TTFT for an LLM API?

Target TTFT under 500 ms for interactive applications. If TTFT exceeds 2 seconds, users perceive the app as slow even if total generation is fast. TTFT above 5 seconds usually indicates API queue congestion.

Should I charge users based on input or output tokens?

Both. Input tokens are usually cheaper (the user provides the prompt), and output tokens are more expensive (the model generates them). Charge separately for each and expose token counts in billing so users understand the cost structure.

Can I reduce latency by batching requests?

Yes, but with trade-offs. Batch inference (sending 100 requests together) has higher throughput but higher latency per request (you wait for the slowest request in the batch). For latency-sensitive interactive apps, single inference is better. For background batch jobs, batching reduces cost per token.

Further Reading