LLM Inference Optimization: Measure First
LLM inference optimization begins with measurement. You cannot optimize what you do not measure. The three critical metrics for inference latency are time-to-first-token (TTFT)—the wall-clock time from a user's request to the first token of the response—tokens-per-second (TPS) during decoding, and cumulative request latency. This article teaches you how to instrument a serving system, collect baseline metrics, and identify optimization opportunities before deploying advanced techniques.
What Are the Core Inference Metrics?
Time-to-First-Token (TTFT) is the elapsed time from when the model receives a complete prompt until the first token is generated. TTFT is dominated by the prefill phase, where the model processes the entire prompt in one pass. A typical 7B-parameter model running on a single A100 GPU achieves TTFT of 100-500ms depending on prompt length and batch size (Hugging Face, 2024). TTFT directly impacts user perception: research shows that every 100ms of additional latency reduces engagement by 2-3%.
Tokens-Per-Second (TPS) or throughput measures the number of tokens your model can generate per second during the decode phase. This metric is hardware-bound and roughly constant per model-hardware combination. A 7B model on a single A100 GPU generates approximately 80-120 tokens per second in single-request mode. TPS scales linearly with the number of GPUs in a distributed system.
Time-Per-Token (TPT) in decode is the inverse of TPS: on a single A100, TPT is approximately 8-12 milliseconds. This metric helps you understand whether perceived latency comes from prefill or decode phase.
Batch processing overhead increases TTFT when multiple requests are batched together, because all requests wait for the slowest prompt to finish prefill. Quantifying this tradeoff (batch size vs. TTFT inflation) is essential.
Why Measure Before Optimizing?
Optimization without measurement is guesswork. Three common mistakes teams make:
- Optimizing the wrong bottleneck. A system with 200ms TTFT bottlenecked by prefill compute gains nothing from decode speedups; you must first identify which phase needs attention.
- Introducing regressions. Quantization, for example, can reduce latency but degrade output quality; measurement reveals the accuracy-speed tradeoff.
- Overcomplicating the stack. If your baseline TTFT is 150ms and your target is 100ms, streaming (which adds client-side buffering complexity) may not be necessary; a simpler batch-size adjustment might suffice.
Measure first to make informed architecture decisions.
Instrumenting a Baseline Inference Server
Below is a minimal Python-based inference server using vLLM (a widely used production inference framework) and custom instrumentation to collect TTFT, TPS, and per-request latency:
import time
import numpy as np
from vllm import LLM, SamplingParams
from datetime import datetime
class LatencyInstrument:
"""Capture inference metrics: TTFT, TPS, and per-request latency."""
def __init__(self):
self.results = []
def record(self, prompt_id, prefill_ms, decode_tokens, decode_ms):
"""Record a single inference result."""
ttft = prefill_ms
tps = decode_tokens / (decode_ms / 1000.0) if decode_ms > 0 else 0
end_to_end_ms = prefill_ms + decode_ms
self.results.append({
'prompt_id': prompt_id,
'ttft_ms': ttft,
'tps': tps,
'end_to_end_ms': end_to_end_ms,
'decode_tokens': decode_tokens,
})
def report(self):
"""Print summary statistics."""
if not self.results:
print("No results recorded.")
return
ttfts = [r['ttft_ms'] for r in self.results]
tps_list = [r['tps'] for r in self.results]
e2e = [r['end_to_end_ms'] for r in self.results]
print(f"TTFT: p50={np.percentile(ttfts, 50):.1f}ms, "
f"p99={np.percentile(ttfts, 99):.1f}ms, "
f"max={max(ttfts):.1f}ms")
print(f"TPS: mean={np.mean(tps_list):.1f}, "
f"min={min(tps_list):.1f}, "
f"max={max(tps_list):.1f}")
print(f"End-to-end: p50={np.percentile(e2e, 50):.1f}ms, "
f"p99={np.percentile(e2e, 99):.1f}ms")
# Load a model
llm = LLM(model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9)
sampling_params = SamplingParams(max_tokens=128)
# Run benchmark
instrument = LatencyInstrument()
prompts = [
"Explain quantum computing in two paragraphs.",
"What are the top 5 programming languages in 2026?",
"Write a Python function to compute Fibonacci numbers.",
]
for i, prompt in enumerate(prompts):
t_start = time.perf_counter()
# Note: vLLM's generate_async already tracks internal latencies.
# For your own inference, instrument prefill and decode separately.
outputs = llm.generate(prompts=[prompt], sampling_params=sampling_params)
t_end = time.perf_counter()
elapsed_ms = (t_end - t_start) * 1000
output_tokens = len(outputs[0].outputs[0].token_ids)
# Simplified: assume 60% of elapsed time is prefill, 40% is decode
prefill_ms = elapsed_ms * 0.6
decode_ms = elapsed_ms * 0.4
instrument.record(i, prefill_ms, output_tokens, decode_ms)
instrument.report()
Run this benchmark with a range of prompt lengths and batch sizes to establish your baseline. Record results in a CSV for trend analysis.
Key Inference Metrics Table
| Metric | Single Request | Batch of 8 | Notes |
|---|---|---|---|
| TTFT (7B LLM, A100) | 100-200ms | 300-800ms | Increases with batch due to prefill queueing |
| TPS (decode phase) | 100 tokens/sec | 100 tokens/sec | Per-model constant, independent of batch |
| Time-per-token (decode) | 10ms | 10ms | Inverse of TPS; prefill overhead not included |
| End-to-end (128 tokens) | 1.3s-2.5s | 1.3s-2.5s | Decode dominates; latency predictable if TPS known |
The table shows that while TTFT increases with batching, TPS (and thus TPT in decode) remains constant. This is the fundamental tradeoff: batch more to improve throughput, but users see higher TTFT.
Profiling Tool Recommendations
Use these open-source tools to profile baseline inference:
- vLLM's built-in statistics: vLLM logs request latencies. Enable with
--enable-prefix-cachingand parse logs for TTFT distribution. - NVIDIA Nsys: GPU-level profiling to identify compute bottlenecks during prefill vs. decode. Use
nsys profile -c cudaall python your_inference.py. - Custom instrumentation: Hook into your inference library (Hugging Face Transformers, vLLM, TGI) to capture timestamps at key points: prompt arrival, prefill start, first token emission, last token emission.
Key Takeaways
- TTFT, TPS, and TPT are the fundamental inference metrics; measure all three to understand your bottleneck.
- Establish a baseline before implementing optimization techniques to validate that each change actually helps.
- Profile both single-request and batch scenarios because optimization strategies (batching, caching) have different effects on each.
- Use percentiles, not just means: p99 TTFT matters more to users than average TTFT.
- Automate measurement into your CI/CD pipeline so regressions are caught early.
Frequently Asked Questions
What is a good baseline TTFT target for a production LLM?
TTFT targets depend on your use case. For real-time chat (e.g., OpenAI ChatGPT), 100-200ms is considered responsive. For batch summarization, 1-2 seconds is acceptable. For voice-activated assistants, sub-100ms is required. Start by measuring your current system, then set targets based on user research, not industry benchmarks.
How do I measure TTFT on a distributed system with multiple GPUs?
Use a load-testing client that sends timestamped requests to your load-balanced inference API. The client records when each request is sent and when the first response token arrives. If using a framework like vLLM with multiple GPUs, each request's TTFT includes network latency, so measure end-to-end from the client perspective.
Can I improve TTFT by throwing more GPU memory at the problem?
GPU memory alone does not reduce TTFT; it increases batch capacity. TTFT is limited by the time to run the prefill kernel on available compute cores. To reduce TTFT, you need faster compute (larger GPU, more GPUs in tensor parallelism) or smaller prompts. Memory helps by enabling larger batches without OOM errors, which indirectly reduces TTFT inflation.
Should I optimize for TTFT or throughput first?
Optimize TTFT first if your workload is interactive (chat, search). Optimize throughput (TPS) if your workload is batch or asynchronous (document summarization, log analysis). Most production systems optimize both, but the tradeoff (batching increases TTFT) means you cannot maximize both simultaneously. Measure your actual user distribution and set targets accordingly.