Skip to main content

Neural Network Pruning: Reduce Model Size 5-10x

Pruning removes redundant weights and neurons from trained models, creating sparse networks that are smaller, faster, and sometimes even more generalizable. Unlike quantization, which reduces precision, pruning eliminates weights entirely—setting them to zero and often removing them from the computational graph. A heavily pruned model can achieve 5-10x compression with minimal accuracy loss. Combined with distillation and quantization, pruning is the final step in aggressive compression pipelines: distill to compress knowledge, quantize to reduce precision, prune to eliminate redundancy.

Pruning Fundamentals

Pruning works because neural networks learn redundant representations. A pre-trained model has more capacity than necessary for most tasks: some weights are nearly zero and contribute little to predictions, some neurons are correlated and duplicate computation, and entire attention heads in transformers can be bypassed without accuracy loss. By identifying and removing these redundancies, you shrink the model without degrading performance.

Two main categories of pruning:

Unstructured Pruning: Remove individual weights based on magnitude or importance scores. This creates sparse weight matrices that require special hardware to accelerate (most mobile/edge chips do not support sparse operations efficiently).

Structured Pruning: Remove entire neurons, attention heads, or layers. This reduces the number of operations and creates regular (dense) sub-networks that standard hardware can efficiently run. Structured pruning is usually preferred in practice because it is hardware-compatible and easier to deploy.

Magnitude-Based Pruning

The simplest approach: remove weights with the smallest absolute values.

import torch
import torch.nn as nn

def magnitude_pruning(model, prune_ratio=0.5):
"""
Remove a fraction of weights based on magnitude.

Args:
model: PyTorch model
prune_ratio: Fraction of weights to prune (0.5 = 50%)

Returns:
Pruned model with sparse weight tensors
"""
total_params = 0
pruned_params = 0

for name, module in model.named_modules():
if isinstance(module, nn.Linear):
# Compute magnitude threshold
weight_magnitude = torch.abs(module.weight)
threshold = torch.quantile(
weight_magnitude.flatten(),
prune_ratio # Threshold at the k-th percentile
)

# Create mask: 1 for weights to keep, 0 for weights to prune
mask = (weight_magnitude > threshold).float()

# Apply mask
module.weight.data *= mask

# Track statistics
total_params += module.weight.numel()
pruned_params += (mask == 0).sum().item()

sparsity = 100 * pruned_params / total_params
print(f"Model sparsity: {sparsity:.1f}% (removed {pruned_params:,} weights)")

return model

# Prune the model
pruned_model = magnitude_pruning(model, prune_ratio=0.7) # Remove 70% of weights

# Verify accuracy
accuracy = evaluate(pruned_model, val_loader)
print(f"Pruned Model Accuracy: {accuracy:.4f}")

Magnitude pruning is fast and simple, but it is unstructured: it creates sparse matrices that are not hardware-efficient. It works well as a post-hoc step (prune a pre-trained model without retraining), but accuracy loss can be 5-15%.

Structured Pruning: Removing Heads and Layers

For hardware efficiency, structured pruning removes entire units:

def prune_attention_heads(model, num_heads_to_prune=0.3):
"""
Remove least-important attention heads from transformer.

Args:
model: Transformer model
num_heads_to_prune: Fraction of heads to remove per layer (0.3 = 30%)

Returns:
Model with pruned attention heads
"""
for layer_idx, layer in enumerate(model.layers):
num_heads = layer.attention.num_heads
num_to_prune = int(num_heads * num_heads_to_prune)

# Compute importance scores for each head (simplified: use weight magnitude)
importance_scores = []
for head_idx in range(num_heads):
head_weights = layer.attention.head_weights[head_idx]
importance = torch.sum(torch.abs(head_weights))
importance_scores.append(importance)

# Identify least important heads
importance_scores = torch.tensor(importance_scores)
prune_indices = torch.topk(
importance_scores,
num_to_prune,
largest=False
).indices.tolist()

# Remove these heads
keep_heads = [i for i in range(num_heads) if i not in prune_indices]
layer.attention.prune_heads(prune_indices)

print(f"Layer {layer_idx}: Pruned {num_to_prune} / {num_heads} heads")

return model

# Prune heads and finetune to recover accuracy
pruned_model = prune_attention_heads(model, num_heads_to_prune=0.3)

# Finetune briefly to recover accuracy
pruned_model = finetune_after_pruning(
pruned_model, train_loader, num_epochs=2, learning_rate=1e-5
)

Structured pruning is more effective for hardware: removing a full head reduces the FLOPs for that component. Pruning 30% of heads often yields 1.3-1.5x speedup and 1-3% accuracy loss (recoverable with brief fine-tuning).

The Lottery Ticket Hypothesis

An influential finding (Frankle & Carbin, 2019): randomly initialized neural networks contain "lottery tickets"—sparse subnetworks that, when trained in isolation, match the full network's accuracy. This suggests that pruning after training can find these high-performing subnetworks.

def iterative_magnitude_pruning(model, train_loader, val_loader, 
num_iterations=10, prune_fraction_per_iter=0.2):
"""
Iteratively prune and retrain: remove low-magnitude weights,
retrain to recover, repeat.

This gradually identifies the lottery ticket (high-performing sparse subnetwork).
"""
initial_state = model.state_dict().copy() # Save initial weights

for iteration in range(num_iterations):
# Prune a small fraction
magnitude_pruning(model, prune_ratio=prune_fraction_per_iter)

# Retrain briefly
finetune_after_pruning(model, train_loader, num_epochs=3, lr=1e-4)

# Evaluate
accuracy = evaluate(model, val_loader)
sparsity = compute_sparsity(model)
print(f"Iteration {iteration}: Sparsity {sparsity:.1f}%, Accuracy {accuracy:.4f}")

return model

# Run iterative pruning
lottery_ticket = iterative_magnitude_pruning(model, train_loader, val_loader)

Iterative pruning is more effective than one-shot pruning: you can achieve 80-90% sparsity (remove 80-90% of weights) while retaining 95%+ accuracy. The trade-off: requires multiple rounds of retraining (slower).

Pruning + Distillation

Combining pruning with distillation yields compounding benefits:

def pruned_distillation(
teacher,
student,
train_loader,
val_loader,
num_epochs=10,
prune_schedule=[], # List of (epoch, sparsity) tuples
):
"""
Prune student progressively during distillation training.

Args:
prune_schedule: e.g., [(2, 0.3), (5, 0.6), (8, 0.8)]
prune at epoch 2 to 30%, at epoch 5 to 60%, etc.
"""
for epoch in range(num_epochs):
# Check if we should prune at this epoch
for prune_epoch, target_sparsity in prune_schedule:
if epoch == prune_epoch:
magnitude_pruning(student, prune_ratio=target_sparsity)
print(f"Epoch {epoch}: Pruned to {target_sparsity*100:.0f}% sparsity")

# Standard distillation training
train_one_epoch(student, teacher, train_loader)
accuracy = evaluate(student, val_loader)
sparsity = compute_sparsity(student)

print(f"Epoch {epoch}: Sparsity {sparsity:.1f}%, Accuracy {accuracy:.4f}")

# Prune gradually during distillation training
pruned_distillation(
teacher, student, train_loader, val_loader,
prune_schedule=[(2, 0.3), (4, 0.5), (6, 0.7), (8, 0.9)]
)

Pruning during distillation training (gradually removing weights while the student learns from the teacher) often outperforms pruning a pre-trained model. The student learns to work with the sparsity constraint from the beginning.

Calibrating Pruning Ratios

How much can you prune before accuracy tanks? This depends on task, architecture, and target hardware:

ArchitectureTaskPruning RatioAccuracy RetentionHardware Notes
BERT-baseClassification50% (unstructured)98-99%Requires sparse ops
BERT-baseClassification30% (structured)96-99%Runs on standard hardware
Llama-7BGeneration70% (unstructured)90-95%Dense hardware only
Llama-7BGeneration40% (structured)93-98%Standard accelerators
ResNet-50ImageNet80% (unstructured)85-92%Mobile hardware poor
MobileNetImageNet50% (structured)95-98%Mobile-optimized base

General rules of thumb:

  • Start with 30-50% sparsity; validate accuracy.
  • Increase to 70-80% if you hit your accuracy threshold.
  • Beyond 90%, expect significant accuracy loss unless using iterative pruning or lottery tickets.
  • Structured pruning on mobile-optimized architectures (MobileNet, EfficientNet) is most practical.

Validating Pruned Models on Hardware

After pruning, test actual deployment latency and memory:

import time
import psutil
import torch

def benchmark_pruned_model(model, input_size=(1, 512), num_runs=100):
"""
Measure latency and peak memory of pruned model on actual hardware.
"""
model.eval()
dummy_input = torch.randn(input_size)

# Warm up
with torch.no_grad():
for _ in range(10):
_ = model(dummy_input)

# Memory profiling
torch.cuda.reset_peak_memory_stats() if torch.cuda.is_available() else None

# Latency profiling
start = time.perf_counter()
with torch.no_grad():
for _ in range(num_runs):
_ = model(dummy_input)
latency = (time.perf_counter() - start) / num_runs * 1000

peak_mem = torch.cuda.max_memory_allocated() / 1e6 if torch.cuda.is_available() else 0

# Model size on disk
torch.save(model.state_dict(), "temp_model.pt")
disk_size = os.path.getsize("temp_model.pt") / 1e6

print(f"Latency: {latency:.1f} ms")
print(f"Peak memory: {peak_mem:.1f} MB")
print(f"Disk size: {disk_size:.1f} MB")

os.remove("temp_model.pt")

For structured pruning on standard hardware (CPUs, GPUs, mobile processors), you should see 1.5-3x speedup and proportional memory savings. Unstructured pruning may show little speedup without specialized sparse matrix support.

Key Takeaways

  • Pruning removes redundant weights or neurons, achieving 5-10x compression with 1-5% accuracy loss.
  • Magnitude-based pruning is simple but unstructured; structured pruning (heads, layers) is hardware-efficient.
  • Iterative pruning (prune → retrain → repeat) achieves higher sparsity (80-90%) than one-shot pruning while retaining accuracy.
  • Pruning during distillation (progressive) often outperforms pruning pre-trained models.
  • Always benchmark on target hardware; unstructured pruning may not accelerate without special sparse ops.

Frequently Asked Questions

How much latency improvement does pruning provide?

Structured pruning on CPU/GPU/mobile: 1.5-3x speedup. Unstructured pruning without sparse ops: 0-10% speedup (weight storage is reduced, but computation is unchanged). Always benchmark on target hardware.

Should I prune the teacher or student?

Prune the student (after distillation). Pruning the teacher before distillation degrades the quality of soft targets, hurting the student. Distill first, then prune and quantize the student.

Can I achieve 90% sparsity while retaining accuracy?

Yes, but not with one-shot magnitude pruning. Use iterative pruning with retraining, or lottery ticket hunting (identify sparse subnetworks). With these techniques, 80-90% sparsity is achievable on most models with 95%+ accuracy retention.

What is the difference between channel pruning and weight pruning?

Weight pruning removes individual weights (unstructured, not hardware-efficient). Channel pruning removes entire output channels of a layer (structured, hardware-efficient). For deployment, channel/structured pruning is preferable.

Can I combine pruning with quantization?

Yes. Typical order: distill → quantize → prune. The student learns as a 8-bit (or 4-bit) quantized model, then structural redundancy is removed via pruning. Combined, you achieve 50-200x compression with 85-95% accuracy.

Further Reading