Skip to main content

Quantizing LLMs: Reduce Latency & Memory

Quantization is the process of reducing the numerical precision of model weights and activations from 32-bit floats (FP32) or 16-bit floats (FP16) to lower precisions (8-bit, 4-bit, or even 2-bit). Quantized models run faster (kernels are optimized for low-precision math), use less memory (weights occupy fewer bytes), and fit on smaller hardware (CPUs, edge devices, mobile). The tradeoff: some loss of accuracy, which is usually minimal for LLMs. Quantization is one of the easiest wins for inference optimization: a one-line configuration change often yields 20-40% latency reduction and 50% memory savings.

Why Quantization Helps Inference Speed

LLM inference is memory-bandwidth-bound, not compute-bound. During each forward pass, you must load all model weights from VRAM into cache. For a 7B-parameter model in FP16, that is 14 GB of data (2 bytes per parameter). On an A100 (2TB/sec bandwidth), loading weights takes ~7ms. Lower precision means fewer bytes:

  • FP16: 2 bytes per parameter
  • FP8 (8-bit float): 1 byte per parameter
  • INT8 (8-bit integer): 1 byte per parameter
  • INT4: 0.5 bytes per parameter

A 7B model quantized to INT8 occupies 7 GB (not 14 GB), so weight loading takes 3.5ms instead of 7ms. Combined with faster low-precision kernels, throughput improves 20-30%.

PrecisionModel Size (7B)Weight Load (ms)Decode Speed (tokens/sec)Quality Loss
FP3228 GB1450Baseline
FP1614 GB7100Minimal
FP87 GB3.5130Minimal (<1% perplexity)
INT87 GB3.5130Minimal
INT43.5 GB1.75180Low (<2% perplexity on most models)

The challenge: quantization is lossy. The lower the precision, the more information is discarded. For LLMs, quantization to INT8 causes almost no measurable quality loss (less than 0.5% perplexity increase). INT4 is more aggressive and shows 1-3% perplexity increase on some models but is still usable for many applications.

Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT)

There are two quantization approaches:

Post-Training Quantization (PTQ): Quantize an already-trained model without retraining. Easy (minutes), but less optimal (higher accuracy loss). Typical: INT8 with PTQ causes 0.5-2% accuracy drop.

Quantization-Aware Training (QAT): Fine-tune the model during training to adapt to lower precision. Hard (requires training data, compute), but near-lossless (< 0.1% accuracy drop). Rarely necessary for inference (PTQ usually sufficient).

For production, PTQ is standard: quantize your model once (few minutes), deploy it, measure accuracy, and adjust precision if needed.

Quantization Methods

Static Quantization

Pre-compute quantization parameters (min, max, scale factor) from a calibration dataset. Fixed for all inputs.

Original weights: [1.2, 3.4, -0.8, 2.1]
Min: -0.8, Max: 3.4
Scale factor: (Max - Min) / (2^8 - 1) = 3.2 / 255 ≈ 0.0125
Quantized: [96, 255, 64, 168] (mapped to [0, 255] range)

Inference: dequantize(quantized_weight) = quantized_weight * scale + min

Simple, fast, but less flexible: the scale factor is fixed across all inputs.

Dynamic Quantization

Compute quantization parameters per-input (per forward pass). More flexible, but slightly slower.

For each forward pass:
Compute min/max of layer inputs
Update scale factor based on current input range
Quantize and compute
Dequantize result

Dynamic quantization adapts to input distribution, potentially preserving more accuracy.

Practical Quantization with AutoGPTQ and GGUF

Two popular quantization formats for LLMs:

Option 1: AutoGPTQ (GPT-Quantization)

AutoGPTQ quantizes models to INT4/INT8 and is optimized for inference on GPUs:

from transformers import GPTQConfig, AutoModelForCausalLM
import torch

# Quantize a model to INT4
quantization_config = GPTQConfig(
bits=4, # 4-bit quantization
group_size=128, # Grouping for quantization
desc_act=False, # Optimized for inference
)

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
quantization_config=quantization_config,
device_map="auto" # Automatically place on GPU
)

# Save quantized model
model.save_pretrained("./llama-2-7b-gptq-int4")

# Load and use (automatically dequantizes to FP16 during inference)
from transformers import pipeline

pipeline_model = pipeline(
"text-generation",
model="./llama-2-7b-gptq-int4",
device=0 # GPU 0
)

output = pipeline_model("Explain quantum computing.", max_length=100)
print(output[0]['generated_text'])

AutoGPTQ quantizes weights to INT4 and stores them in a specialized format. During inference, weights are dequantized just-in-time, adding minimal overhead. INT4 is aggressive but acceptable for most models: Llama, Mistral, etc.

Option 2: GGUF (GGML File Format)

GGUF is a format for quantized models designed for CPU inference (good for edge/mobile):

from llama_cpp import Llama

# Load a GGUF-quantized model (CPU-optimized)
llm = Llama(
model_path="./llama-2-7b.gguf",
n_gpu_layers=-1, # Offload all to GPU if available
n_ctx=2048, # Context size
n_threads=8, # Number of CPU threads
)

response = llm(
"Explain reinforcement learning.",
max_tokens=256,
temperature=0.7
)

print(response['choices'][0]['text'])

GGUF models are smaller (2-4 GB for a 7B model in INT4) and run on CPU. Slower than GPU quantization, but enables deployment on constrained hardware.

Quantization Impact on Quality

Measure accuracy loss with standard LLM benchmarks. Do not trust claims without data:

from lm_eval.tasks import get_task
import numpy as np

def evaluate_quantization_impact(original_model, quantized_model, task_name="hellaswag"):
"""
Compare accuracy of original and quantized models.
"""
task = get_task(task_name)

# Evaluate original
results_original = task.evaluate(original_model)
acc_original = results_original['acc']

# Evaluate quantized
results_quantized = task.evaluate(quantized_model)
acc_quantized = results_quantized['acc']

accuracy_loss = (acc_original - acc_quantized) / acc_original * 100

print(f"Original accuracy: {acc_original:.2%}")
print(f"Quantized accuracy: {acc_quantized:.2%}")
print(f"Accuracy loss: {accuracy_loss:.1f}%")

return accuracy_loss

# Example results (typical)
# Original (FP16): 83.5%
# INT8 quantized: 83.2% (0.4% loss)
# INT4 quantized: 82.8% (0.8% loss)

For most models, INT8 quantization causes negligible accuracy loss. INT4 is acceptable for non-reasoning tasks (summarization, classification) but may hurt reasoning benchmarks.

Quantization Strategies by Use Case

Use CaseRecommended PrecisionRationale
Chat, summarizationINT8 or INT4Minimal accuracy loss; large speedup
Code generationINT8 (avoid INT4)Reasoning requires higher precision
Math/logic reasoningFP16 (no quantization)Low-precision harms accuracy
Long-context (>8k tokens)INT8Memory savings critical
Mobile/edge deploymentINT4 + GGUF formatSize and speed critical

vLLM Quantization Configuration

vLLM supports multiple quantization backends. Enable quantization at model load:

from vllm import LLM, EngineArgs

# Load a pre-quantized model (AutoGPTQ format)
engine_args = EngineArgs(
model="TheBloke/Llama-2-7B-GPTQ", # Pre-quantized from HuggingFace
quantization="gptq", # Specify quantization type
dtype="float16",
gpu_memory_utilization=0.9,
)

llm = LLM(**engine_args.to_dict())

# Alternatively, use GGUF
engine_args_gguf = EngineArgs(
model="./llama-2-7b.gguf",
quantization="gguf",
)

llm_gguf = LLM(**engine_args_gguf.to_dict())

vLLM automatically applies the correct quantization kernels. You do not need to manually dequantize weights.

Measuring Quantization Trade-offs

Use this script to profile speed vs. accuracy:

import time
import numpy as np

def benchmark_quantization(models_dict: dict):
"""
Benchmark latency and accuracy for different quantizations.
models_dict: {"precision": (model_object, accuracy)}
"""

prompt = "Explain machine learning." * 5 # ~60 tokens

results = {}
for precision, (model, accuracy) in models_dict.items():
t0 = time.perf_counter()
output = model.generate([prompt], sampling_params)
latency_ms = (time.perf_counter() - t0) * 1000

results[precision] = {
'latency_ms': latency_ms,
'accuracy': accuracy
}

# Print summary
print(f"{'Precision':<10} {'Latency (ms)':<15} {'Accuracy':<10} {'Speedup':<10}")
print("-" * 50)

baseline_latency = results['FP16']['latency_ms']
for precision, data in results.items():
speedup = baseline_latency / data['latency_ms']
print(f"{precision:<10} {data['latency_ms']:<15.1f} "
f"{data['accuracy']:<10.2%} {speedup:<10.2f}x")

Expected results on a 7B model:

Precision      Latency (ms)    Accuracy   Speedup
FP16 120 83.5% 1.00x
INT8 85 83.2% 1.41x
INT4 68 82.8% 1.76x

Key Takeaways

  • Quantization reduces precision (FP16 → INT8/INT4) to decrease model size and improve speed.
  • 20-40% latency reduction typical for INT8; minimal accuracy loss for most models.
  • INT8 is safe; INT4 is aggressive but acceptable for non-reasoning tasks.
  • Post-Training Quantization (PTQ) is simple: quantize once, deploy everywhere.
  • Measure accuracy on your own benchmarks before deploying quantized models.

Frequently Asked Questions

Can I quantize any model, or only specific architectures?

Most modern models (Llama, Mistral, Falcon, etc.) are quantization-friendly. Models with unusual activations or custom operations may not quantize well. Test on your specific model.

Should I quantize the KV cache as well as model weights?

Yes. Quantizing KV cache (FP8 or INT8) saves 50% memory with negligible accuracy loss. This is especially important for long-context applications. vLLM supports cache_dtype="fp8".

Does quantization help on CPU?

Yes, especially INT4 + GGUF. CPU inference is already slow, so any speedup helps. GGUF models with INT4 run 2-3x faster on CPU than FP16, though still much slower than GPU.

Can I combine quantization with other optimizations (batching, KV caching, prompt caching)?

Yes, and recommended. Quantization + KV cache + prompt caching compound: a 7B INT8 model with KV caching, prompt caching, and request batching can serve 5-10x more requests per GPU than baseline FP16.

Further Reading