Model Quantization: Deploy Smaller Neural Networks
Model quantization is the final compression step in distillation pipelines: after distilling a large model into a smaller one, you quantize the student to 8-bit, 4-bit, or even 2-bit integers, shrinking model size by 2-8x more. A 3B model quantized to 4-bit becomes a 750 MB deployable artifact—small enough for mobile phones and edge devices. Combined with distillation, quantization achieves compression ratios of 40-100x while retaining 90%+ of the original model's accuracy. This article covers the quantization techniques, tradeoffs, and best practices used in production in 2026.
Quantization Basics: From Floats to Integers
Quantization maps floating-point weights from the range [-R, R] to integer values in a smaller bit width:
quantized_value = round(weight / scale)
Where scale is a learned or computed factor that determines the quantization granularity. The inverse operation (dequantization) reconstructs the approximation:
reconstructed_weight ≈ quantized_value * scale
For example, with 8-bit quantization, weights are mapped to integers in the range [0, 255] or [-128, 127]. This reduces storage by 4x (float32 is 4 bytes; int8 is 1 byte) and can accelerate computation on quantization-aware hardware (most modern mobile and edge chips).
Post-Training Quantization (PTQ)
The simplest approach: quantize an already-trained model without retraining. This works surprisingly well in practice (80-95% of full-precision accuracy retained) and is fast (minutes).
import torch
from torch.quantization import quantize_dynamic, quantize_qat
# Load pre-trained model
model = torch.load("distilled_student.pt")
# Post-training quantization (dynamic): quantize weights, activations computed in float
quantized_model = quantize_dynamic(
model,
qconfig_spec={torch.nn.Linear}, # Quantize linear layers
dtype=torch.qint8 # 8-bit integers
)
# Check size reduction
original_size = sum(p.numel() * 4 for p in model.parameters()) / 1e6 # MB
quantized_size = sum(p.numel() * 1 for p in quantized_model.parameters()) / 1e6 # MB (approx)
print(f"Original size: {original_size:.1f} MB")
print(f"Quantized size: {quantized_size:.1f} MB")
print(f"Compression: {100 * (1 - quantized_size / original_size):.1f}%")
Post-training quantization is preferred when:
- You do not have access to training data (e.g., quantizing a downloaded model).
- Speed to deployment is critical (minutes vs. hours).
- The model is small enough that accuracy loss is acceptable.
Drawback: modest accuracy loss (3-8% on average) because the model was not trained to be quantized.
Quantization-Aware Training (QAT)
For better accuracy, train the model with quantization in mind. QAT simulates quantization during training, allowing the model to learn weight distributions that are quantization-friendly:
import torch
from torch.quantization import (
prepare_qat,
convert,
get_default_qat_qconfig,
set_qconfig
)
def quantization_aware_training(
model,
train_loader,
val_loader,
num_epochs=5,
learning_rate=1e-4
):
"""
Fine-tune model with simulated quantization (QAT).
"""
# Configure quantization settings
model.qconfig = get_default_qat_qconfig('fbgemm') # For CPU
# Or for mobile: get_default_qat_qconfig('qnnpack')
# Prepare for QAT: insert fake quantize modules
model = prepare_qat(model, inplace=False)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for epoch in range(num_epochs):
model.train()
for batch_idx, (inputs, labels) in enumerate(train_loader):
outputs = model(inputs)
loss = torch.nn.functional.cross_entropy(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch_idx % 100 == 0:
print(f"Epoch {epoch}, Batch {batch_idx} - Loss: {loss.item():.4f}")
# Convert to actual quantized model
quantized_model = convert(model, inplace=False)
return quantized_model
# Train with quantization awareness
quantized_model = quantization_aware_training(
model, train_loader, val_loader, num_epochs=3
)
# Verify accuracy
accuracy = evaluate(quantized_model, val_loader)
print(f"QAT Quantized Model Accuracy: {accuracy:.4f}")
QAT requires access to training data and takes longer (hours to days), but yields significantly better accuracy (95-99% retention vs. 85-92% for PTQ).
Bit Width Selection and Tradeoffs
Different bit widths offer different compression-accuracy tradeoffs:
| Bit Width | Compression Ratio | Accuracy Retention | Deployment Notes |
|---|---|---|---|
| FP32 (baseline) | 1x | 100% | Baseline; no compression |
| FP16 (half-precision) | 2x | 98-99% | Works on most modern hardware; minimal loss |
| INT8 | 4x | 94-97% | Widely supported; good balance |
| INT4 | 8x | 88-95% | Less universal support; requires careful tuning |
| INT2 (binary) | 16x | 70-85% | Extreme compression; large accuracy loss |
For most practitioners, int8 is the sweet spot: good compression (4x), high accuracy retention (95%+), and broad hardware support. int4 is worth exploring if you hit memory limits or latency budgets on int8.
Advanced Quantization: Per-Channel and Mixed Precision
By default, quantization uses a single scale factor per layer (per-tensor quantization). More granular approaches:
Per-Channel Quantization: Different scale factors for different output channels. This allows heterogeneous bit widths—e.g., important channels stay at 8-bit, less important channels go to 4-bit.
def per_channel_quantization(model, bit_widths):
"""
Assign different bit widths to different channels/heads based on importance.
"""
importances = compute_channel_importance(model)
for layer in model.layers:
for channel_idx, importance in enumerate(importances[layer]):
if importance > threshold_high:
bit_widths[layer][channel_idx] = 8 # Important: keep 8-bit
elif importance > threshold_low:
bit_widths[layer][channel_idx] = 4 # Medium: 4-bit
else:
bit_widths[layer][channel_idx] = 2 # Unimportant: 2-bit
return apply_mixed_precision_quantization(model, bit_widths)
Mixed Precision: Use different bit widths for different layer types. E.g., attention layers might be int8 (critical for reasoning), while feed-forward layers might be int4 (less critical).
def mixed_precision_quantization(model):
"""
Quantize different layers to different bit widths.
"""
quantization_config = {
'attention': {'weight_bits': 8, 'activation_bits': 8}, # Critical
'ffn': {'weight_bits': 4, 'activation_bits': 4}, # Less critical
'embedding': {'weight_bits': 8, 'activation_bits': 8}, # Critical
}
return apply_config_to_model(model, quantization_config)
Per-channel and mixed precision improve accuracy retention by 1-3% but add deployment complexity (different scales for different parts). Use them when you hit accuracy floors with uniform quantization.
Validating Quantized Models
After quantization, rigorously test accuracy and latency:
import time
import numpy as np
def evaluate_quantized_model(
quantized_model,
original_model,
test_loader,
device='cpu'
):
"""
Compare quantized vs. original model on accuracy and speed.
"""
quantized_model.to(device)
original_model.to(device)
# Accuracy comparison
quant_correct, orig_correct = 0, 0
quant_time, orig_time = 0, 0
for inputs, labels in test_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Quantized inference
start = time.perf_counter()
with torch.no_grad():
quant_out = quantized_model(inputs)
quant_time += time.perf_counter() - start
# Original inference
start = time.perf_counter()
with torch.no_grad():
orig_out = original_model(inputs)
orig_time += time.perf_counter() - start
quant_correct += (quant_out.argmax(dim=1) == labels).sum().item()
orig_correct += (orig_out.argmax(dim=1) == labels).sum().item()
total = len(test_loader.dataset)
quant_acc = quant_correct / total
orig_acc = orig_correct / total
speedup = orig_time / quant_time
print(f"Original Accuracy: {orig_acc:.4f}, Latency: {orig_time*1000:.1f} ms")
print(f"Quantized Accuracy: {quant_acc:.4f}, Latency: {quant_time*1000:.1f} ms")
print(f"Accuracy Retention: {100*quant_acc/orig_acc:.1f}%")
print(f"Speedup: {speedup:.1f}x")
return {'accuracy_retention': quant_acc / orig_acc, 'speedup': speedup}
A typical result: int8 quantization yields 95-97% accuracy retention and 1.5-3x speedup (depending on hardware support).
Exporting for Deployment
After quantization, export to deployment-friendly formats:
# Export to ONNX (cross-platform)
dummy_input = torch.randn(1, 512)
torch.onnx.export(
quantized_model,
dummy_input,
"student_quantized.onnx",
input_names=['input_ids'],
output_names=['logits'],
opset_version=14
)
# Export to TensorFlow Lite (mobile)
# (Requires conversion pipeline; see TFLite docs)
converter = tf.lite.TFLiteConverter.from_saved_model("quantized_model_tf")
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
tflite_model = converter.convert()
with open("student_quantized.tflite", "wb") as f:
f.write(tflite_model)
# Export to Apple Core ML (iOS)
import coremltools as ct
mlmodel = ct.convert(
quantized_model,
inputs=[ct.TensorType(name="input_ids", shape=[1, 512])]
)
mlmodel.save("student_quantized.mlmodel")
Key Takeaways
- Quantization maps floating-point weights to integers (8-bit, 4-bit, 2-bit), reducing size by 2-8x.
- Post-training quantization (PTQ) is fast but loses 3-8% accuracy; quantization-aware training (QAT) retains 95-99% accuracy but takes longer.
- INT8 is the practical sweet spot: 4x compression, 95%+ accuracy retention, widely supported hardware.
- Per-channel and mixed precision improve accuracy by 1-3% but add complexity; use when hitting accuracy floors.
- Always validate quantized models on both accuracy and latency before deployment.
Frequently Asked Questions
Can I quantize a model to 2-bit or 1-bit?
Yes, but expect 20-30% accuracy loss. Binary models (1-bit) are extreme; they work for certain tasks (classification on large, clean datasets) but are impractical for general NLP/vision. Stick to 4-bit as a minimum for production.
Does quantization work with all architectures?
Most architectures work fine (Transformers, CNNs, RNNs). Attention mechanisms quantize particularly well because they are data-driven and robust. Some custom layers may require special handling; test empirically.
Should I quantize before or after distillation?
Always distill first, then quantize. A distilled model quantizes more cleanly because the student already learned compressed representations. Quantizing before distillation introduces rounding errors that degrade both model and student.
How much latency improvement does quantization actually provide?
On hardware with quantization support (modern CPUs, mobile chips, specialized accelerators), 1.5-3x speedup is typical. On unsupported hardware, speedup is negligible (quantized model still computed in float). Always benchmark on target hardware.
Can I combine quantization with pruning?
Yes, and it is encouraged. Distill → quantize → prune (or prune → quantize) achieves compression ratios of 50-200x with 85-95% accuracy. The order matters slightly; empirically, quantize then prune slightly outperforms prune then quantize.