Merging LoRA Adapters into Base Models
After fine-tuning, you face a deployment choice: keep adapters separate (small, composable) or merge them into the base model (simpler, faster inference). Merging combines the adapter matrices U @ V^T with the original weights W, producing a single unified checkpoint that runs without LoRA overhead. This guide explains both approaches, when to use each, and practical merge strategies for different deployment scenarios (single-task inference, multi-adapter serving, quantized models).
Merging Fundamentals
Recall from earlier articles: during fine-tuning, LoRA adds a low-rank correction to specific weight matrices:
W_effective = W_base + (alpha / r) * U @ V^T
Merging permanently incorporates this correction:
W_merged = W_base + (alpha / r) * U @ V^T
After merging, you discard the adapter files (U and V) and keep only W_merged. During inference, the model uses only W_merged, eliminating the cost of dequantizing and adding the adapter correction (slight speedup, but typically <5% in practice due to efficient fused kernels).
Strategy 1: Keep Adapters Separate (Recommended for Multi-Task)
Load the base model and adapter at inference time:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
# Load adapter
model = PeftModel.from_pretrained(
base_model,
"./llama2-7b-customer-support-adapter",
is_trainable=False
)
# Inference
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
input_text = "Instruction: Classify: I lost my password."
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
Advantages:
- Small adapter files (10–500 MB vs. 13+ GB for full model).
- Easy to serve multiple adapters on the same base model (see Article 9).
- Version control-friendly; adapters change independently of base.
- No re-quantization needed if using QLoRA.
Disadvantages:
- Slight inference overhead (dequantize and add correction on every forward pass).
- Model-loading time includes both base and adapter.
- Requires PEFT library at inference time.
Strategy 2: Merge Adapters (Recommended for Single-Task Deployment)
Permanently merge the adapter into the base model:
from transformers import AutoModelForCausalLM
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
# Load adapter
model = PeftModel.from_pretrained(
base_model,
"./llama2-7b-customer-support-adapter",
is_trainable=False
)
# Merge adapters into base model
merged_model = model.merge_and_unload()
# Save merged model (now a standard Hugging Face model)
merged_model.save_pretrained("./llama2-7b-customer-support-merged")
# Verify: merged_model no longer has LoRA layers
print(merged_model)
The merge_and_unload() method:
- Computes
W_merged = W_base + (alpha / r) * U @ V^Tfor each LoRA layer. - Replaces the original weights with merged versions.
- Removes LoRA layers from the model (unloads adapters).
- Returns a standard Hugging Face model (no PEFT dependency).
Advantages:
- Standard Hugging Face model; no PEFT dependency at inference.
- Slightly faster inference (no on-the-fly dequantization and addition).
- Easier deployment to production systems (TorchServe, vLLM, TensorRT).
- One checkpoint to manage (no separate adapter files).
Disadvantages:
- Large file size (13 GB for 7B model; 280 GB for 70B).
- Hard to maintain multiple task-specific variants (need separate checkpoints).
- If using QLoRA, re-quantization after merge degrades quality.
Load Merged Model for Inference
Once merged and saved, load as a standard model:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load merged model (standard Hugging Face)
merged_model = AutoModelForCausalLM.from_pretrained(
"./llama2-7b-customer-support-merged",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./llama2-7b-customer-support-merged")
# Inference
input_text = "Instruction: Classify: I need to reset my password."
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
outputs = merged_model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
No PEFT library needed. This is the standard AutoModelForCausalLM.from_pretrained() workflow.
Partial Merging: Merge Specific Adapters
If you have multiple adapters trained on the same base model, merge only one:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Load first adapter
model = PeftModel.from_pretrained(base_model, "./adapter-task-1")
# Merge only this adapter, unload it, but keep base model unchanged
merged = model.merge_and_unload()
# Now merged is still a Hugging Face model; save it
merged.save_pretrained("./llama2-7b-task-1-merged")
# To switch to another adapter:
# Load base again (it wasn't modified)
base_model_2 = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model_2 = PeftModel.from_pretrained(base_model_2, "./adapter-task-2")
merged_2 = model_2.merge_and_unload()
merged_2.save_pretrained("./llama2-7b-task-2-merged")
Quantization After Merge
If you merged from a full-precision (float32) model, you can quantize to reduce file size:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Load merged model (float32)
merged_model = AutoModelForCausalLM.from_pretrained(
"./llama2-7b-customer-support-merged"
)
# Quantize to 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
)
# Save quantized version
# Note: bitsandbytes quantization is not saved; you'd need GPTQ or other methods
# For production, consider loading merged model in float16 instead
merged_fp16 = merged_model.to(torch.float16)
merged_fp16.save_pretrained("./llama2-7b-customer-support-merged-fp16")
Important caveat: If you fine-tuned with QLoRA (quantized base + float32 adapter), merging is trickier. The base is in 4-bit, the adapter in float32. You must dequantize, merge, then re-quantize, which may degrade quality slightly. For QLoRA models, keeping adapters separate is usually preferable.
Comparison: Merged vs. Separate
| Aspect | Merged | Separate |
|---|---|---|
| Inference speed | Slightly faster (no dequant overhead) | Slightly slower |
| File size | 13 GB (7B) / 280 GB (70B) | 50 MB + 13 GB base |
| PEFT dependency | No | Yes |
| Multi-adapter serving | Hard (need separate checkpoints) | Easy (one base + many adapters) |
| Quantization support (post-merge) | Requires re-quantization | Transparent with QLoRA |
| Deployment complexity | Lower (standard Hugging Face) | Higher (requires PEFT) |
Recommendation:
- Single-task production: Merge for simplicity and standard deployment.
- Multi-task or research: Keep separate for flexibility and modularity.
- QLoRA deployments: Keep separate unless quality loss is acceptable.
Merge Workflow for Production
Here's a complete pipeline from training to production deployment:
# 1. After training, save the LoRA adapter
# (This was done in Article 6)
# 2. Merge for production
from transformers import AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16
)
model_with_adapter = PeftModel.from_pretrained(
base_model,
"./llama2-7b-customer-support-adapter"
)
merged_model = model_with_adapter.merge_and_unload()
# 3. Convert to float16 for memory efficiency
merged_model = merged_model.to(torch.float16)
# 4. Save merged model
merged_model.save_pretrained("./llama2-7b-customer-support-v1.0")
tokenizer.save_pretrained("./llama2-7b-customer-support-v1.0")
# 5. Deploy to production (see Article 10)
# Push to Hugging Face Hub, TorchServe, or vLLM
# 6. At inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"./llama2-7b-customer-support-v1.0",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./llama2-7b-customer-support-v1.0")
# Standard inference, no PEFT
Checking Merge Success
Verify that adapters were properly merged:
# Before merge: model has LoRA layers
model = PeftModel.from_pretrained(base_model, adapter_dir)
print("Before merge:")
print([name for name, _ in model.named_modules() if "lora" in name.lower()])
# Output: ['model.layers.0.self_attn.q_proj.lora_A.default', 'model.layers.0.self_attn.q_proj.lora_B.default', ...]
# After merge: LoRA layers removed
merged_model = model.merge_and_unload()
print("\nAfter merge:")
print([name for name, _ in merged_model.named_modules() if "lora" in name.lower()])
# Output: [] (empty)
# Verify weights were updated
print("\nWeight shapes match original:")
print(f"Original q_proj: {base_model.model.layers[0].self_attn.q_proj.weight.shape}")
print(f"Merged q_proj: {merged_model.model.layers[0].self_attn.q_proj.weight.shape}")
# Both should be (4096, 4096)
# Verify weights differ (merge added adapter)
weights_identical = torch.allclose(
base_model.model.layers[0].self_attn.q_proj.weight,
merged_model.model.layers[0].self_attn.q_proj.weight,
atol=1e-5
)
print(f"\nWeights changed after merge: {not weights_identical}")
# Should print: True
Key Takeaways
- Separate adapters: Small, composable, easy to version. Recommended for research and multi-task deployments.
- Merged models: Simple, standard, faster inference. Recommended for single-task production.
model.merge_and_unload()permanently incorporates the LoRA correction and removes adapter layers.- QLoRA models should typically keep adapters separate; merging requires re-quantization.
- After merging, models are standard Hugging Face checkpoints; no PEFT dependency needed.
Frequently Asked Questions
Can I unmerge an adapter after merging?
No. Merging is irreversible; you've overwritten the base weights. If you need to switch adapters, keep the original base model and load different adapters into it separately.
Does merging change the model's accuracy?
No, merging is mathematically exact (assuming floating-point precision). The merged model produces identical outputs to the separate base + adapter model.
What if I merged the wrong adapter by accident?
Start over: reload the original base model and merge the correct adapter. Keep the original base model as a backup.
Can I merge multiple adapters into one model?
Not directly via merge_and_unload(), which merges one adapter per base model. To combine multiple adapters, you'd need custom code. See Article 9 on composition for alternatives.
Further Reading
- PEFT Merging Guide — Official documentation on adapter merging strategies.
- Model Merging Techniques — Research on combining multiple fine-tuned models.
- vLLM LoRA Integration — Production-scale LoRA serving with dynamic adapter loading.
- Hugging Face Model Hub — Browse merged and adapter models from the community.