Skip to main content

Building a Tuned LLM Inference Stack

Building a production LLM serving system requires combining all optimization techniques from previous articles into a coherent architecture that balances TTFT, throughput, cost, and reliability. A naive stacking of optimizations often backfires: KV caching saves memory but increases latency variance if caches are evicted; batching improves throughput but increases TTFT; quantization saves memory but may degrade quality. This article walks through a complete tuned serving stack, explains how techniques interact, and provides a production-ready configuration.

The Full Optimization Stack

A production system combines 5-6 optimization layers:

User Request

[Load Balancer: Route to best server]

[API Server: FastAPI/Triton, handle HTTP]

[Token Streaming: SSE/WebSocket]

[Request Batching: Continuous batching with token budgets]

[KV Cache: In-memory attention cache]

[Prompt Cache: Reuse encoded prefixes]

[Quantized Model: INT8/INT4 weights]

[Multi-GPU Inference: Tensor/pipeline parallelism]

[Response Buffering: Return streamed tokens]

Client

Each layer adds complexity and tradeoffs. A fully-tuned stack handles 100-1000x more requests than a baseline single-GPU setup.

Architecture Decision Tree

Choosing which optimizations to implement depends on your constraints:

START: How many concurrent users?

< 100 users/sec, TTFT-critical?
→ YES: Optimize TTFT (tensor parallelism, smaller model, prompt caching)
→ NO: Optimize throughput (batching, quantization)

> 100 users/sec?
→ YES: Multi-GPU + batching + quantization + caching
→ Consider: Distributed sharding (split model across GPUs/nodes)

Memory-constrained (< 40GB VRAM)?
→ INT8 quantization required
→ Limit batch size, enable prompt caching to maximize cache hits

Quality-critical (reasoning, code)?
→ Avoid INT4 quantization
→ Use only INT8 + KV caching
→ Prefer larger models over optimization tricks

Cost-critical (minimize GPU hours)?
→ Maximize throughput: large batches, quantization, prompt caching
→ Trade TTFT for efficiency

Production Configuration: vLLM + FastAPI

Below is a complete, production-ready serving stack:

from vllm import LLM, EngineArgs, SamplingParams
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
import asyncio
import json
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Configure the inference engine
engine_args = EngineArgs(
model="meta-llama/Llama-2-7b-hf",

# Multi-GPU setup
tensor_parallel_size=2, # Shard across 2 GPUs
pipeline_parallel_size=1, # No pipeline parallelism

# Memory management
gpu_memory_utilization=0.85, # Use 85% of GPU VRAM
dtype="float16",

# Batching and scheduling
max_num_batched_tokens=8192, # Max tokens per batch
max_num_seqs=64, # Max concurrent requests
max_seq_len_to_capture=4096,

# Caching strategies
enable_prefix_caching=True, # Reuse prompt prefixes
block_size=16, # KV cache block size

# KV cache precision
cache_dtype="fp8", # 8-bit KV cache

# Preemption for bursty traffic
preempt_mode="recompute", # Recompute on preemption

# CPU fallback (for evicted KV caches)
swap_space=2, # 2 GB CPU swap for KV cache
)

# Load the model
try:
llm = LLM(**engine_args.to_dict())
logger.info(f"Loaded model: {engine_args.model}")
except Exception as e:
logger.error(f"Failed to load model: {e}")
raise

app = FastAPI(title="LLM Serving Stack", version="1.0.0")

# Request/response schemas
from pydantic import BaseModel

class CompletionRequest(BaseModel):
prompt: str
max_tokens: int = 256
temperature: float = 0.7
top_p: float = 0.9
stream: bool = True

class CompletionResponse(BaseModel):
text: str
tokens: int

# Streaming response generator
async def stream_completion(prompt: str, params: SamplingParams):
"""Generate tokens and yield them for streaming."""

# Run inference in thread pool to avoid blocking
loop = asyncio.get_event_loop()
outputs = await loop.run_in_executor(
None,
lambda: llm.generate(prompts=[prompt], sampling_params=params)
)

output = outputs[0]
for token_id in output.outputs[0].token_ids:
token_text = llm.get_tokenizer().decode([token_id], skip_special_tokens=True)
yield f"data: {json.dumps({'token': token_text})}\n\n"
await asyncio.sleep(0.01) # Debounce: send tokens every 10ms

yield "data: [DONE]\n\n"

@app.post("/v1/completions")
async def completions(request: CompletionRequest):
"""
Endpoint for synchronous or streaming completions.
Uses vLLM's batching and caching automatically.
"""

if not request.prompt:
raise HTTPException(status_code=400, detail="Prompt required")

sampling_params = SamplingParams(
max_tokens=request.max_tokens,
temperature=request.temperature,
top_p=request.top_p,
)

if request.stream:
return StreamingResponse(
stream_completion(request.prompt, sampling_params),
media_type="text/event-stream"
)
else:
# Non-streaming: wait for full response
outputs = llm.generate(
prompts=[request.prompt],
sampling_params=sampling_params
)

full_response = "".join(
llm.get_tokenizer().decode([t])
for t in outputs[0].outputs[0].token_ids
)

return CompletionResponse(
text=full_response,
tokens=len(outputs[0].outputs[0].token_ids)
)

@app.get("/v1/health")
async def health_check():
"""Health check endpoint."""
return {"status": "healthy", "model": engine_args.model}

@app.get("/v1/models")
async def list_models():
"""List available models."""
return {"models": [engine_args.model]}

if __name__ == "__main__":
import uvicorn

# Run with multiple workers for concurrency
uvicorn.run(
app,
host="0.0.0.0",
port=8000,
workers=4, # Multiple uvicorn workers for concurrency
log_level="info"
)

This configuration:

  • Uses 2 GPUs via tensor parallelism for TTFT reduction.
  • Batches requests automatically (max 8192 tokens per batch).
  • Caches prompt prefixes to skip redundant prefill.
  • Stores KV cache in 8-bit precision to reduce memory.
  • Streams tokens to clients in real-time (10ms debounce).
  • Falls back to CPU swap if GPU memory fills up.

Tuning Guide: Which Optimizations Matter Most?

Not all optimizations help equally. Priority depends on your bottleneck:

If bottleneck is TTFT (prefill)

Priority:

  1. Tensor parallelism (2-4 GPUs): Reduces prefill latency 2-4x.
  2. Prompt caching: Eliminates prefill for repeated prefixes.
  3. Smaller model: 3B instead of 7B cuts TTFT by 50%.
  4. Batching: Minimal impact on TTFT (actually increases it slightly).

Skip: Decode optimizations (speculative decoding, quantization).

If bottleneck is throughput (tokens/sec)

Priority:

  1. Request batching: Increases TPS 3-5x.
  2. Quantization (INT8): Faster kernels, 20-30% TPS improvement.
  3. KV caching: Already default; maintains 2.5-3x decode speedup.
  4. Multi-GPU: More hardware = more TPS.

Skip: TTFT optimizations (tensor parallelism, prompt caching).

If bottleneck is memory

Priority:

  1. Quantization (INT8/INT4): Reduces model size by 50-75%.
  2. KV cache quantization (FP8): Halves cache memory.
  3. Prompt caching: Reuse cache across requests.
  4. Smaller model: 3B model uses 50% less memory than 7B.

Skip: Techniques that increase memory (multiple GPU copies, large batch sizes).

End-to-End Tuning Example: Document Q&A System

A document Q&A system (retrieve document, ask questions) has a specific workload: same documents queried by different users. Optimization strategy:

# Configuration for document Q&A
engine_args = EngineArgs(
model="meta-llama/Llama-2-7b-hf",

# Prefill optimization (shared documents)
tensor_parallel_size=2, # Fast prefill for initial document load
enable_prefix_caching=True, # Reuse document encodings across queries

# Batching for throughput
max_num_batched_tokens=4096, # Moderate batching (don't increase TTFT too much)

# Memory efficiency
gpu_memory_utilization=0.9,
cache_dtype="fp8", # Compact KV cache

# Avoid speculative decoding (decode is fast with prompt caching)
)

# Expected performance:
# - Document load (first query): 400ms TTFT (prefill)
# - Subsequent queries (same document): 50ms TTFT (cached)
# - 200+ queries/sec throughput (batching + cached prefill)

Monitoring and Auto-Tuning

A production system monitors metrics and auto-adjusts:

import time
from collections import deque

class ServerMetrics:
"""Track inference metrics and suggest tuning."""

def __init__(self, window_size: int = 100):
self.ttfts = deque(maxlen=window_size)
self.tps_list = deque(maxlen=window_size)
self.queue_lengths = deque(maxlen=window_size)

def record_request(self, ttft_ms: float, tokens_per_sec: float, queue_len: int):
self.ttfts.append(ttft_ms)
self.tps_list.append(tokens_per_sec)
self.queue_lengths.append(queue_len)

def suggest_tuning(self) -> str:
"""Suggest optimizations based on current metrics."""

if not self.ttfts:
return "Warming up..."

p99_ttft = sorted(self.ttfts)[int(len(self.ttfts) * 0.99)]
avg_queue = sum(self.queue_lengths) / len(self.queue_lengths)
avg_tps = sum(self.tps_list) / len(self.tps_list)

suggestions = []

if p99_ttft > 500:
suggestions.append("TTFT high: enable tensor parallelism or prompt caching")

if avg_queue > 32:
suggestions.append("Queue backlog: increase batch size or add GPUs")

if avg_tps < 50:
suggestions.append("Low throughput: quantize model to INT8")

return "\n".join(suggestions) if suggestions else "System well-tuned"

def report(self) -> dict:
"""Return current metrics."""
if not self.ttfts:
return {}

return {
"p50_ttft_ms": sorted(self.ttfts)[len(self.ttfts) // 2],
"p99_ttft_ms": sorted(self.ttfts)[int(len(self.ttfts) * 0.99)],
"avg_tps": sum(self.tps_list) / len(self.tps_list),
"avg_queue_length": sum(self.queue_lengths) / len(self.queue_lengths),
}

# Usage
metrics = ServerMetrics()

# After each request, record metrics
metrics.record_request(ttft_ms=150, tokens_per_sec=95, queue_len=5)
metrics.record_request(ttft_ms=180, tokens_per_sec=100, queue_len=8)
# ... more requests ...

print("Metrics:", metrics.report())
print("Suggestions:", metrics.suggest_tuning())

Key Takeaways

  • Combine 5-6 optimization layers for production systems: batching, caching, quantization, streaming, multi-GPU.
  • Profile first: Identify whether bottleneck is TTFT, throughput, or memory before tuning.
  • Prioritize based on workload: TTFT optimizations (tensor parallelism, prompt caching) for latency; batching + quantization for throughput.
  • Monitor and auto-tune: Track p99 TTFT, throughput, queue length; suggest optimizations dynamically.
  • Test accuracy: Quantization and batching can degrade quality; measure on real benchmarks.

Frequently Asked Questions

Should I use all optimizations at once, or start with a subset?

Start with the highest-impact for your bottleneck (see decision tree above). Add others incrementally while monitoring accuracy and latency. A common starting point: request batching + KV caching + prompt caching. Add tensor parallelism or quantization only if needed.

How do I choose between tensor parallelism and pipeline parallelism?

Tensor parallelism is faster for inference (lower communication overhead). Pipeline parallelism is slower but allows splitting across nodes. For a single machine: use tensor parallelism. For multiple machines: use pipeline parallelism (or distributed tensor parallelism).

Can I dynamically change batch size based on queue length?

Yes, but it is complex. vLLM's scheduler already does adaptive batching (requests join/leave batches dynamically). Changing max_num_batched_tokens or max_num_seqs at runtime requires reloading the model.

How do I debug if my fully-optimized stack is still slow?

Profile at each layer: measure TTFT before/after batching, before/after caching, before/after quantization. Use NVIDIA Nsys to identify GPU bottlenecks (compute vs. memory-bound). Check that all optimizations are actually enabled (common mistake: forgetting to set enable_prefix_caching=True).

Further Reading