Building a Tuned LLM Inference Stack
Building a production LLM serving system requires combining all optimization techniques from previous articles into a coherent architecture that balances TTFT, throughput, cost, and reliability. A naive stacking of optimizations often backfires: KV caching saves memory but increases latency variance if caches are evicted; batching improves throughput but increases TTFT; quantization saves memory but may degrade quality. This article walks through a complete tuned serving stack, explains how techniques interact, and provides a production-ready configuration.
The Full Optimization Stack
A production system combines 5-6 optimization layers:
User Request
↓
[Load Balancer: Route to best server]
↓
[API Server: FastAPI/Triton, handle HTTP]
↓
[Token Streaming: SSE/WebSocket]
↓
[Request Batching: Continuous batching with token budgets]
↓
[KV Cache: In-memory attention cache]
↓
[Prompt Cache: Reuse encoded prefixes]
↓
[Quantized Model: INT8/INT4 weights]
↓
[Multi-GPU Inference: Tensor/pipeline parallelism]
↓
[Response Buffering: Return streamed tokens]
↓
Client
Each layer adds complexity and tradeoffs. A fully-tuned stack handles 100-1000x more requests than a baseline single-GPU setup.
Architecture Decision Tree
Choosing which optimizations to implement depends on your constraints:
START: How many concurrent users?
< 100 users/sec, TTFT-critical?
→ YES: Optimize TTFT (tensor parallelism, smaller model, prompt caching)
→ NO: Optimize throughput (batching, quantization)
> 100 users/sec?
→ YES: Multi-GPU + batching + quantization + caching
→ Consider: Distributed sharding (split model across GPUs/nodes)
Memory-constrained (< 40GB VRAM)?
→ INT8 quantization required
→ Limit batch size, enable prompt caching to maximize cache hits
Quality-critical (reasoning, code)?
→ Avoid INT4 quantization
→ Use only INT8 + KV caching
→ Prefer larger models over optimization tricks
Cost-critical (minimize GPU hours)?
→ Maximize throughput: large batches, quantization, prompt caching
→ Trade TTFT for efficiency
Production Configuration: vLLM + FastAPI
Below is a complete, production-ready serving stack:
from vllm import LLM, EngineArgs, SamplingParams
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
import asyncio
import json
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Configure the inference engine
engine_args = EngineArgs(
model="meta-llama/Llama-2-7b-hf",
# Multi-GPU setup
tensor_parallel_size=2, # Shard across 2 GPUs
pipeline_parallel_size=1, # No pipeline parallelism
# Memory management
gpu_memory_utilization=0.85, # Use 85% of GPU VRAM
dtype="float16",
# Batching and scheduling
max_num_batched_tokens=8192, # Max tokens per batch
max_num_seqs=64, # Max concurrent requests
max_seq_len_to_capture=4096,
# Caching strategies
enable_prefix_caching=True, # Reuse prompt prefixes
block_size=16, # KV cache block size
# KV cache precision
cache_dtype="fp8", # 8-bit KV cache
# Preemption for bursty traffic
preempt_mode="recompute", # Recompute on preemption
# CPU fallback (for evicted KV caches)
swap_space=2, # 2 GB CPU swap for KV cache
)
# Load the model
try:
llm = LLM(**engine_args.to_dict())
logger.info(f"Loaded model: {engine_args.model}")
except Exception as e:
logger.error(f"Failed to load model: {e}")
raise
app = FastAPI(title="LLM Serving Stack", version="1.0.0")
# Request/response schemas
from pydantic import BaseModel
class CompletionRequest(BaseModel):
prompt: str
max_tokens: int = 256
temperature: float = 0.7
top_p: float = 0.9
stream: bool = True
class CompletionResponse(BaseModel):
text: str
tokens: int
# Streaming response generator
async def stream_completion(prompt: str, params: SamplingParams):
"""Generate tokens and yield them for streaming."""
# Run inference in thread pool to avoid blocking
loop = asyncio.get_event_loop()
outputs = await loop.run_in_executor(
None,
lambda: llm.generate(prompts=[prompt], sampling_params=params)
)
output = outputs[0]
for token_id in output.outputs[0].token_ids:
token_text = llm.get_tokenizer().decode([token_id], skip_special_tokens=True)
yield f"data: {json.dumps({'token': token_text})}\n\n"
await asyncio.sleep(0.01) # Debounce: send tokens every 10ms
yield "data: [DONE]\n\n"
@app.post("/v1/completions")
async def completions(request: CompletionRequest):
"""
Endpoint for synchronous or streaming completions.
Uses vLLM's batching and caching automatically.
"""
if not request.prompt:
raise HTTPException(status_code=400, detail="Prompt required")
sampling_params = SamplingParams(
max_tokens=request.max_tokens,
temperature=request.temperature,
top_p=request.top_p,
)
if request.stream:
return StreamingResponse(
stream_completion(request.prompt, sampling_params),
media_type="text/event-stream"
)
else:
# Non-streaming: wait for full response
outputs = llm.generate(
prompts=[request.prompt],
sampling_params=sampling_params
)
full_response = "".join(
llm.get_tokenizer().decode([t])
for t in outputs[0].outputs[0].token_ids
)
return CompletionResponse(
text=full_response,
tokens=len(outputs[0].outputs[0].token_ids)
)
@app.get("/v1/health")
async def health_check():
"""Health check endpoint."""
return {"status": "healthy", "model": engine_args.model}
@app.get("/v1/models")
async def list_models():
"""List available models."""
return {"models": [engine_args.model]}
if __name__ == "__main__":
import uvicorn
# Run with multiple workers for concurrency
uvicorn.run(
app,
host="0.0.0.0",
port=8000,
workers=4, # Multiple uvicorn workers for concurrency
log_level="info"
)
This configuration:
- Uses 2 GPUs via tensor parallelism for TTFT reduction.
- Batches requests automatically (max 8192 tokens per batch).
- Caches prompt prefixes to skip redundant prefill.
- Stores KV cache in 8-bit precision to reduce memory.
- Streams tokens to clients in real-time (10ms debounce).
- Falls back to CPU swap if GPU memory fills up.
Tuning Guide: Which Optimizations Matter Most?
Not all optimizations help equally. Priority depends on your bottleneck:
If bottleneck is TTFT (prefill)
Priority:
- Tensor parallelism (2-4 GPUs): Reduces prefill latency 2-4x.
- Prompt caching: Eliminates prefill for repeated prefixes.
- Smaller model: 3B instead of 7B cuts TTFT by 50%.
- Batching: Minimal impact on TTFT (actually increases it slightly).
Skip: Decode optimizations (speculative decoding, quantization).
If bottleneck is throughput (tokens/sec)
Priority:
- Request batching: Increases TPS 3-5x.
- Quantization (INT8): Faster kernels, 20-30% TPS improvement.
- KV caching: Already default; maintains 2.5-3x decode speedup.
- Multi-GPU: More hardware = more TPS.
Skip: TTFT optimizations (tensor parallelism, prompt caching).
If bottleneck is memory
Priority:
- Quantization (INT8/INT4): Reduces model size by 50-75%.
- KV cache quantization (FP8): Halves cache memory.
- Prompt caching: Reuse cache across requests.
- Smaller model: 3B model uses 50% less memory than 7B.
Skip: Techniques that increase memory (multiple GPU copies, large batch sizes).
End-to-End Tuning Example: Document Q&A System
A document Q&A system (retrieve document, ask questions) has a specific workload: same documents queried by different users. Optimization strategy:
# Configuration for document Q&A
engine_args = EngineArgs(
model="meta-llama/Llama-2-7b-hf",
# Prefill optimization (shared documents)
tensor_parallel_size=2, # Fast prefill for initial document load
enable_prefix_caching=True, # Reuse document encodings across queries
# Batching for throughput
max_num_batched_tokens=4096, # Moderate batching (don't increase TTFT too much)
# Memory efficiency
gpu_memory_utilization=0.9,
cache_dtype="fp8", # Compact KV cache
# Avoid speculative decoding (decode is fast with prompt caching)
)
# Expected performance:
# - Document load (first query): 400ms TTFT (prefill)
# - Subsequent queries (same document): 50ms TTFT (cached)
# - 200+ queries/sec throughput (batching + cached prefill)
Monitoring and Auto-Tuning
A production system monitors metrics and auto-adjusts:
import time
from collections import deque
class ServerMetrics:
"""Track inference metrics and suggest tuning."""
def __init__(self, window_size: int = 100):
self.ttfts = deque(maxlen=window_size)
self.tps_list = deque(maxlen=window_size)
self.queue_lengths = deque(maxlen=window_size)
def record_request(self, ttft_ms: float, tokens_per_sec: float, queue_len: int):
self.ttfts.append(ttft_ms)
self.tps_list.append(tokens_per_sec)
self.queue_lengths.append(queue_len)
def suggest_tuning(self) -> str:
"""Suggest optimizations based on current metrics."""
if not self.ttfts:
return "Warming up..."
p99_ttft = sorted(self.ttfts)[int(len(self.ttfts) * 0.99)]
avg_queue = sum(self.queue_lengths) / len(self.queue_lengths)
avg_tps = sum(self.tps_list) / len(self.tps_list)
suggestions = []
if p99_ttft > 500:
suggestions.append("TTFT high: enable tensor parallelism or prompt caching")
if avg_queue > 32:
suggestions.append("Queue backlog: increase batch size or add GPUs")
if avg_tps < 50:
suggestions.append("Low throughput: quantize model to INT8")
return "\n".join(suggestions) if suggestions else "System well-tuned"
def report(self) -> dict:
"""Return current metrics."""
if not self.ttfts:
return {}
return {
"p50_ttft_ms": sorted(self.ttfts)[len(self.ttfts) // 2],
"p99_ttft_ms": sorted(self.ttfts)[int(len(self.ttfts) * 0.99)],
"avg_tps": sum(self.tps_list) / len(self.tps_list),
"avg_queue_length": sum(self.queue_lengths) / len(self.queue_lengths),
}
# Usage
metrics = ServerMetrics()
# After each request, record metrics
metrics.record_request(ttft_ms=150, tokens_per_sec=95, queue_len=5)
metrics.record_request(ttft_ms=180, tokens_per_sec=100, queue_len=8)
# ... more requests ...
print("Metrics:", metrics.report())
print("Suggestions:", metrics.suggest_tuning())
Key Takeaways
- Combine 5-6 optimization layers for production systems: batching, caching, quantization, streaming, multi-GPU.
- Profile first: Identify whether bottleneck is TTFT, throughput, or memory before tuning.
- Prioritize based on workload: TTFT optimizations (tensor parallelism, prompt caching) for latency; batching + quantization for throughput.
- Monitor and auto-tune: Track p99 TTFT, throughput, queue length; suggest optimizations dynamically.
- Test accuracy: Quantization and batching can degrade quality; measure on real benchmarks.
Frequently Asked Questions
Should I use all optimizations at once, or start with a subset?
Start with the highest-impact for your bottleneck (see decision tree above). Add others incrementally while monitoring accuracy and latency. A common starting point: request batching + KV caching + prompt caching. Add tensor parallelism or quantization only if needed.
How do I choose between tensor parallelism and pipeline parallelism?
Tensor parallelism is faster for inference (lower communication overhead). Pipeline parallelism is slower but allows splitting across nodes. For a single machine: use tensor parallelism. For multiple machines: use pipeline parallelism (or distributed tensor parallelism).
Can I dynamically change batch size based on queue length?
Yes, but it is complex. vLLM's scheduler already does adaptive batching (requests join/leave batches dynamically). Changing max_num_batched_tokens or max_num_seqs at runtime requires reloading the model.
How do I debug if my fully-optimized stack is still slow?
Profile at each layer: measure TTFT before/after batching, before/after caching, before/after quantization. Use NVIDIA Nsys to identify GPU bottlenecks (compute vs. memory-bound). Check that all optimizations are actually enabled (common mistake: forgetting to set enable_prefix_caching=True).