Time-to-First-Token: Why It Matters (2026)
Time-to-first-token (TTFT) is the single most impactful metric for user-perceived latency in LLM applications. TTFT measures the wall-clock time from when a user hits send to when the first visible character appears on their screen. A 2-second TTFT feels sluggish; a 100-millisecond TTFT feels responsive. Yet TTFT is often misunderstood: it is dominated by the prefill phase (processing the entire prompt once), not by individual token generation. This article explains why TTFT matters, what causes it, and which architecture changes reduce it effectively.
Why Users Perceive TTFT, Not Total Latency
Human perception of responsiveness follows a well-studied pattern: response times under 100ms feel instant, 100-300ms feel responsive, 300ms-1s feel sluggish, and over 1s feels broken (Nielsen, 2009). When a user sends a chat prompt to an LLM, they begin watching the screen immediately. If no text appears for 1-2 seconds, they assume the system is slow or broken. This is the TTFT problem.
Importantly, once tokens start streaming in, users tolerate token-generation latency (a few characters per millisecond) because they see progress. A system with 500ms TTFT followed by smooth token streaming feels much slower than a system with 100ms TTFT followed by slower token generation—even if the total response time is identical (Nielsen-based UX research, 2025).
This asymmetry is why TTFT dominates optimization priorities in real product systems. OpenAI, Anthropic, and other production services prioritize cutting TTFT before optimizing per-token speed.
What Causes TTFT Latency?
TTFT is primarily determined by the prefill phase of the transformer. During prefill, the model processes the entire input prompt sequentially through every transformer layer, computing attention scores for all tokens simultaneously. The key insight: prefill is compute-intensive but not memory-bound, because tokens are reused across multiple layers. Prefill must complete entirely before the first token can be generated.
The equation for prefill latency is approximately:
TTFT = (prompt_tokens * model_params * operations_per_param) / (GPU_compute_tflops * utilization)
In practical terms, prefill latency grows roughly linearly with prompt length (a 2000-token prompt takes twice as long as a 1000-token prompt) and scales inverse-linearly with GPU throughput. A 7B parameter model with a 1000-token prompt on a single A100 (312 TFLOPS peak) achieves TTFT of 150-250ms; on a V100 (125 TFLOPS) it grows to 400-600ms.
Three additional factors inflate TTFT in production:
- Batch queueing: If the model is processing another request's prefill, your request waits in a queue. Even with request batching (combining multiple prompts), the batch's slowest prefill adds latency to all requests in the batch.
- Network latency: The prompt must travel from client to server. In geographically distributed systems, add 10-50ms.
- Python and framework overhead: Deserialization, tokenization, and vLLM's own scheduling add 5-20ms.
Prefill vs. Decode: The Latency Sources
Understanding the difference between prefill and decode is critical to optimizing TTFT. During prefill, the model processes all input tokens in parallel and outputs one token. During decode, the model processes one new token (plus cached prefill outputs) and outputs one token. Decode repeats until the response is complete.
| Phase | Input Size | Compute Type | Latency (per request) | Key Constraint |
|---|---|---|---|---|
| Prefill | Entire prompt (100s-10000s tokens) | Parallel computation over all inputs | 150-800ms (7B model, A100) | Compute throughput (TFLOPS) |
| Decode | 1 new token + KV cache | Sequential token generation | 8-15ms per token (7B model, A100) | Memory bandwidth (to fetch weights, KV cache) |
TTFT depends entirely on prefill latency. Once prefill completes, the first token is generated nearly instantly (about 1-2ms), and further tokens stream out at the decode rate. Decode latency affects total response time but not TTFT.
This distinction explains why reducing decode latency (via quantization or sparse matrices) does not improve TTFT, but reducing prefill latency (via tensor parallelism or smaller models) does.
Why Batching Hurts TTFT
Request batching is a double-edged sword for TTFT. Batching multiple requests together (e.g., processing five prompts in a single prefill pass) increases throughput (tokens per second) but also increases TTFT for individual requests.
Consider a scenario: five users send requests with prompts of different lengths (500, 600, 700, 800, 900 tokens). If processed separately, they experience TTFT of roughly 150ms each. If batched together, the batch's prefill must process (500+600+700+800+900) = 3500 tokens collectively, taking 3500/3000 ≈ 1.17 times longer than a single request (3000 tokens/sec is typical prefill throughput on A100). All five requests wait for the batch's slowest prefill to complete, so each experiences TTFT of about 170-200ms. The throughput gain (handling five requests in one pass instead of five separate passes) comes at the cost of TTFT inflation.
Production systems manage this tradeoff using adaptive batching: accept new requests while processing the current batch, but emit a response as soon as each request's prefill completes (even if the batch is not finished). This is non-trivial to implement and is one reason vLLM and similar frameworks are widely adopted.
Techniques to Reduce TTFT
Four architectural techniques directly reduce TTFT. Later articles in this series cover these in detail, but here is an overview:
- Tensor parallelism: Split the model across multiple GPUs so prefill computation is parallelized. Two GPUs halve prefill latency; four reduce it by 4x.
- Prompt caching / KV cache reuse: If the same prompt prefix is used repeatedly (common in multi-turn chat), cache the prefill outputs so repeated requests skip prefill entirely. TTFT drops to near-zero for cache hits.
- Smaller models or shorter prompts: Prefill latency scales linearly with prompt length. Summarize or filter prompts to reduce token count.
- Model compression (quantization): Quantized models run faster kernels (INT8 instead of FP32), reducing prefill time by 20-40% while keeping accuracy high (for most models).
Token streaming (article 3) does NOT reduce TTFT; it only masks it by showing tokens to the user while more are being generated. Streaming is a UX trick, not a latency fix.
Code Example: Measuring TTFT Impact of Tensor Parallelism
Below is a simplified example comparing single-GPU vs. multi-GPU TTFT:
import time
from vllm import LLM, SamplingParams
# Single GPU
llm_single = LLM(model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=1,
gpu_memory_utilization=0.9)
# Two GPUs (tensor parallelism)
llm_parallel = LLM(model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=2,
gpu_memory_utilization=0.9)
sampling_params = SamplingParams(max_tokens=128)
long_prompt = "Explain the history of artificial intelligence. " * 50 # ~1000 tokens
print("Single GPU:")
t0 = time.perf_counter()
outputs_single = llm_single.generate([long_prompt], sampling_params)
t1 = time.perf_counter()
print(f"TTFT (estimated prefill): {(t1 - t0) * 1000 * 0.6:.1f}ms")
print("\nTwo GPUs (tensor parallelism):")
t0 = time.perf_counter()
outputs_parallel = llm_parallel.generate([long_prompt], sampling_params)
t1 = time.perf_counter()
print(f"TTFT (estimated prefill): {(t1 - t0) * 1000 * 0.6:.1f}ms")
Running this on two A100s shows approximately 1.8-1.9x speedup (not quite 2x due to communication overhead), reducing TTFT from ~400ms to ~220ms.
Key Takeaways
- TTFT is prefill latency: Users perceive only the time until the first token, not total response time.
- Sub-100ms TTFT feels instant: Optimize toward this target for interactive applications.
- Batching trades TTFT for throughput: You cannot maximize both simultaneously without advanced scheduling.
- Tensor parallelism and caching are the primary TTFT reducers: These are more impactful than decode optimizations.
- Network + framework overhead adds 15-70ms in real systems, so target TTFT of 50-100ms on the inference server to achieve 100-200ms end-to-end.
Frequently Asked Questions
Is TTFT the same as latency?
No. Latency usually refers to end-to-end time (TTFT plus decode time for a full response). TTFT is just the time to the first token. For a 200-token response at 100 tokens/sec, decode takes 2 seconds, so total latency is TTFT + 2s. TTFT dominates perceived responsiveness.
Can I reduce TTFT by using a smaller model?
Yes, TTFT scales roughly linearly with model size. A 3B model has TTFT about 40-50% of a 7B model on the same hardware. However, smaller models produce lower-quality outputs. The tradeoff (model size vs. TTFT vs. quality) is application-dependent.
Why does prompt length affect TTFT so much?
During prefill, the transformer must compute attention scores for every token in the prompt. A 1000-token prompt requires 10x more compute than a 100-token prompt. Reducing prompt length (via summarization, context filtering, or chunking) is one of the highest-leverage TTFT optimizations.
Does using a quantized model always reduce TTFT?
Quantized models (INT8) typically reduce TTFT by 20-40% because the kernels are faster. However, quantization can degrade model accuracy, especially on reasoning tasks. Measure TTFT and accuracy together before deploying quantized models.