Skip to main content

Request Batching for LLM Serving

Request batching is the most effective technique for improving LLM serving throughput without buying additional hardware. By processing multiple user requests in a single forward pass, a system can increase tokens-per-second by 3-5x while keeping per-request latency acceptable. However, batching introduces a fundamental tradeoff: waiting for requests to accumulate in a batch increases latency for individual requests. Production systems solve this with continuous batching, which allows requests to join and leave a batch dynamically as generation progresses.

Why Batching Increases Throughput

Transformer models are massively parallel: they can process multiple sequences (requests) simultaneously with minimal overhead. Consider a single A100 GPU processing one request at a time: it might generate 100 tokens per second (TPS). Now batch eight requests together: the GPU processes eight prompts in prefill (with ~8x more compute but also 8x more parallelism), then generates tokens for all eight sequences in parallel during decode. Throughput jumps to 600-800 TPS because the GPU is fuller (higher utilization).

The math: if you send one request at a time, you use X% of GPU compute. If you batch four requests, you use closer to 4X% of GPU compute (parallelism is nearly linear for moderate batch sizes). Underutilized GPUs are expensive GPUs—a single-request system wastes 60-80% of available compute.

SetupRequestsAvg TTFTThroughput (TPS)GPU Utilization
No batching (req 1 at a time)1150ms10020%
Static batch size 88 per pass800ms60085%
Continuous batching (10ms accumulate)~2-3 avg250ms45075%

Continuous batching (right column) balances throughput with latency: requests wait just 10ms to be batched, so TTFT is only slightly higher than single-request, but throughput is much higher.

Static vs. Continuous Batching

Static batching processes a fixed number of requests per iteration. You wait for eight requests to arrive, process them together, emit all responses, then wait for the next batch. Simple to implement but introduces bursty latency: if requests arrive slowly, the eighth request waits 1-2 seconds to be processed.

Continuous batching (also called dynamic batching or in-flight batching) is more sophisticated: requests join the batch immediately upon arrival, and requests leave the batch when their generation is complete. The batch size changes every iteration. A request does not wait for other requests; it joins an in-flight batch and receives a response as soon as its tokens are done generating.

Example timeline for continuous batching:

Time=0ms:   Request A arrives, joins batch, prefill starts
Time=10ms: Request B arrives, joins batch (batch size 2)
Time=100ms: Prefill completes, decode starts (Request A and B)
Time=110ms: Request C arrives, joins decode phase (batch size 3)
Time=150ms: Request A finishes (200 tokens), leaves batch (size 2)
Time=200ms: Request B finishes, leaves batch (size 1)
Time=250ms: Request C finishes, batch empty

In this example, Request A sees TTFT ~100ms, Request B sees ~100ms (joined during prefill), and Request C sees ~110ms (joined during decode). Compare to static batching with batch size 8: the eighth request would wait 2-3 seconds for a full batch.

Continuous Batching with Token Budgets

The challenge of continuous batching: how many requests should you accumulate before starting prefill? If you start immediately (batch size 1), you get low latency but low throughput. If you wait too long (hoping more requests arrive), you hurt latency for early requests.

The solution: token budget batching. Instead of waiting for a fixed number of requests, wait for a fixed number of tokens to accumulate. This is fairer because a 1000-token prompt is more expensive than a 100-token prompt.

def should_start_batch(pending_requests, token_budget=4096):
"""
Decide whether to start a batch given pending requests
and a token budget (total tokens, not request count).
"""
total_tokens = sum(len(tokenize(r.prompt)) for r in pending_requests)

# Start batch if:
# 1. We have no pending requests (should not happen in prod)
# 2. Token budget is full
# 3. We have waited more than max_wait_ms (10-50ms is typical)
if not pending_requests:
return False
if total_tokens >= token_budget:
return True
if time.time() - pending_requests[0].arrival_time > 0.05: # 50ms max wait
return True
return False

A typical token budget is 4096-8192 tokens (enough for 8-16 requests of 512 tokens each). Adjust based on your GPU memory and latency targets.

Implementing Continuous Batching with vLLM

vLLM handles continuous batching automatically via its Scheduler. You do not need to implement batching yourself; vLLM manages request queues and dynamic batching internally. However, understanding the mechanics is important for tuning.

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs

# Initialize LLM with batching configuration
engine_args = EngineArgs(
model="meta-llama/Llama-2-7b-hf",
dtype="float16",
gpu_memory_utilization=0.9,
max_num_batched_tokens=4096, # Token budget per batch
max_num_seqs=64, # Max requests per batch
enable_prefix_caching=False, # (Enable in article 6)
)

llm = LLM(**engine_args.to_dict())
sampling_params = SamplingParams(max_tokens=256)

# Simulate concurrent requests arriving
import asyncio
import time

async def submit_request_async(prompt: str, request_id: int):
"""Simulate a user sending a request."""
print(f"[{request_id}] Submitting at {time.time():.2f}")
# In production, use LLM.generate_async or an API server
outputs = llm.generate(prompts=[prompt], sampling_params=sampling_params)
return outputs

# Send requests (in real prod, these arrive via HTTP)
prompts = [
"Explain machine learning in 50 words." * 10, # ~100 tokens
"What is quantum computing?" * 20, # ~120 tokens
"How do neural networks learn?" * 15, # ~90 tokens
]

# Launch concurrent submissions
async def main():
tasks = [submit_request_async(p, i) for i, p in enumerate(prompts)]
results = await asyncio.gather(*tasks)

# asyncio.run(main()) # Run this in a real app

In a production API server (FastAPI, Flask), vLLM's engine processes requests concurrently. Each request is added to a queue; the scheduler batches them according to token budget and latency targets.

Batching Trade-offs and Tuning

The core tradeoff: larger batches increase throughput but can increase latency for requests that arrive first (they wait for slower requests to arrive). Tune these parameters:

  1. max_num_batched_tokens: Total token capacity per batch. Higher (e.g., 8192) = higher throughput, higher latency. Lower (e.g., 2048) = lower latency, underutilizes GPU.
  2. max_num_seqs: Max requests per batch. If you hit this limit before hitting the token budget, new requests must wait. Typical: 32-64 on a single A100.
  3. max_wait_ms: Maximum time a request waits to be batched before being processed alone. Typical: 10-50ms.

For latency-sensitive workloads (chat, search), prefer: max_num_batched_tokens=2048, max_wait_ms=20ms.

For throughput-focused workloads (batch summarization, log analysis), prefer: max_num_batched_tokens=8192, max_wait_ms=100ms.

Measuring Batch Impact on Your Workload

Use this script to profile TTFT and throughput under different batch configurations:

import time
import statistics
from vllm import LLM, SamplingParams

def benchmark_batching(batch_size: int, num_requests: int):
"""Benchmark TTFT and throughput for a given batch size."""
llm = LLM(model="meta-llama/Llama-2-7b-hf",
max_num_seqs=batch_size)

sampling_params = SamplingParams(max_tokens=128)
prompts = ["Explain AI in one sentence. "] * num_requests

ttfts = []
for prompt in prompts:
t0 = time.perf_counter()
llm.generate([prompt], sampling_params)
t1 = time.perf_counter()
ttfts.append((t1 - t0) * 1000)

avg_ttft = statistics.mean(ttfts)
throughput = (num_requests * 128) / sum(ttfts) * 1000 # tokens/sec

print(f"Batch size {batch_size}: TTFT p50={statistics.median(ttfts):.1f}ms, "
f"throughput={throughput:.0f} TPS")

for bs in [1, 2, 4, 8, 16]:
benchmark_batching(bs, num_requests=32)

Expected output: TTFT increases gradually (1-2x) while throughput jumps 3-5x. Find your sweet spot.

Key Takeaways

  • Batching multiplies throughput 3-5x by parallelizing requests on a single GPU.
  • Continuous batching minimizes TTFT inflation by allowing requests to join and leave dynamically.
  • Token budgets beat request-count budgets because larger prompts cost more.
  • Max wait time is critical: 10-50ms wait is acceptable; 1+ second is not.
  • Profile with your actual workload: latency-sensitive and throughput-focused systems require different tuning.

Frequently Asked Questions

How long should I wait before starting a batch?

Balance depends on request arrival rate. If 10 requests arrive per second, a 50ms wait accumulates ~0.5 requests on average. If 1 request per second, 50ms wait accumulates only ~0.05 requests. In practice, 20-50ms is a good default. Measure your P99 request arrival rate and set max_wait so that at least 50% of batches have multiple requests.

Does batching hurt response quality?

No. The model processes each prompt identically whether in a batch or alone. Batching is purely an execution optimization.

Can I batch requests with very different prompt lengths?

Yes, but inefficiently. If you batch a 100-token prompt with a 2000-token prompt, the 100-token prompt's prefill completes early, but it waits in memory for the 2000-token prompt's prefill. Minimize prompt-length variance by sorting requests before batching or using separate request queues for short/long prompts.

What happens if I set max_num_batched_tokens too high?

You run out of GPU memory (OOM) or the model becomes too slow. Profile your GPU memory under peak load and leave 10-20% headroom.

Further Reading