Skip to main content

Merging LoRA Adapters into Base Models

After fine-tuning, you face a deployment choice: keep adapters separate (small, composable) or merge them into the base model (simpler, faster inference). Merging combines the adapter matrices U @ V^T with the original weights W, producing a single unified checkpoint that runs without LoRA overhead. This guide explains both approaches, when to use each, and practical merge strategies for different deployment scenarios (single-task inference, multi-adapter serving, quantized models).

Merging Fundamentals

Recall from earlier articles: during fine-tuning, LoRA adds a low-rank correction to specific weight matrices:

W_effective = W_base + (alpha / r) * U @ V^T

Merging permanently incorporates this correction:

W_merged = W_base + (alpha / r) * U @ V^T

After merging, you discard the adapter files (U and V) and keep only W_merged. During inference, the model uses only W_merged, eliminating the cost of dequantizing and adding the adapter correction (slight speedup, but typically <5% in practice due to efficient fused kernels).

Load the base model and adapter at inference time:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)

# Load adapter
model = PeftModel.from_pretrained(
base_model,
"./llama2-7b-customer-support-adapter",
is_trainable=False
)

# Inference
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
input_text = "Instruction: Classify: I lost my password."
inputs = tokenizer(input_text, return_tensors="pt")

with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0]))

Advantages:

  • Small adapter files (10–500 MB vs. 13+ GB for full model).
  • Easy to serve multiple adapters on the same base model (see Article 9).
  • Version control-friendly; adapters change independently of base.
  • No re-quantization needed if using QLoRA.

Disadvantages:

  • Slight inference overhead (dequantize and add correction on every forward pass).
  • Model-loading time includes both base and adapter.
  • Requires PEFT library at inference time.

Permanently merge the adapter into the base model:

from transformers import AutoModelForCausalLM
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)

# Load adapter
model = PeftModel.from_pretrained(
base_model,
"./llama2-7b-customer-support-adapter",
is_trainable=False
)

# Merge adapters into base model
merged_model = model.merge_and_unload()

# Save merged model (now a standard Hugging Face model)
merged_model.save_pretrained("./llama2-7b-customer-support-merged")

# Verify: merged_model no longer has LoRA layers
print(merged_model)

The merge_and_unload() method:

  1. Computes W_merged = W_base + (alpha / r) * U @ V^T for each LoRA layer.
  2. Replaces the original weights with merged versions.
  3. Removes LoRA layers from the model (unloads adapters).
  4. Returns a standard Hugging Face model (no PEFT dependency).

Advantages:

  • Standard Hugging Face model; no PEFT dependency at inference.
  • Slightly faster inference (no on-the-fly dequantization and addition).
  • Easier deployment to production systems (TorchServe, vLLM, TensorRT).
  • One checkpoint to manage (no separate adapter files).

Disadvantages:

  • Large file size (13 GB for 7B model; 280 GB for 70B).
  • Hard to maintain multiple task-specific variants (need separate checkpoints).
  • If using QLoRA, re-quantization after merge degrades quality.

Load Merged Model for Inference

Once merged and saved, load as a standard model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load merged model (standard Hugging Face)
merged_model = AutoModelForCausalLM.from_pretrained(
"./llama2-7b-customer-support-merged",
torch_dtype=torch.float16,
device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("./llama2-7b-customer-support-merged")

# Inference
input_text = "Instruction: Classify: I need to reset my password."
inputs = tokenizer(input_text, return_tensors="pt")

with torch.no_grad():
outputs = merged_model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0]))

No PEFT library needed. This is the standard AutoModelForCausalLM.from_pretrained() workflow.

Partial Merging: Merge Specific Adapters

If you have multiple adapters trained on the same base model, merge only one:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load first adapter
model = PeftModel.from_pretrained(base_model, "./adapter-task-1")

# Merge only this adapter, unload it, but keep base model unchanged
merged = model.merge_and_unload()

# Now merged is still a Hugging Face model; save it
merged.save_pretrained("./llama2-7b-task-1-merged")

# To switch to another adapter:
# Load base again (it wasn't modified)
base_model_2 = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model_2 = PeftModel.from_pretrained(base_model_2, "./adapter-task-2")
merged_2 = model_2.merge_and_unload()
merged_2.save_pretrained("./llama2-7b-task-2-merged")

Quantization After Merge

If you merged from a full-precision (float32) model, you can quantize to reduce file size:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Load merged model (float32)
merged_model = AutoModelForCausalLM.from_pretrained(
"./llama2-7b-customer-support-merged"
)

# Quantize to 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
)

# Save quantized version
# Note: bitsandbytes quantization is not saved; you'd need GPTQ or other methods
# For production, consider loading merged model in float16 instead

merged_fp16 = merged_model.to(torch.float16)
merged_fp16.save_pretrained("./llama2-7b-customer-support-merged-fp16")

Important caveat: If you fine-tuned with QLoRA (quantized base + float32 adapter), merging is trickier. The base is in 4-bit, the adapter in float32. You must dequantize, merge, then re-quantize, which may degrade quality slightly. For QLoRA models, keeping adapters separate is usually preferable.

Comparison: Merged vs. Separate

AspectMergedSeparate
Inference speedSlightly faster (no dequant overhead)Slightly slower
File size13 GB (7B) / 280 GB (70B)50 MB + 13 GB base
PEFT dependencyNoYes
Multi-adapter servingHard (need separate checkpoints)Easy (one base + many adapters)
Quantization support (post-merge)Requires re-quantizationTransparent with QLoRA
Deployment complexityLower (standard Hugging Face)Higher (requires PEFT)

Recommendation:

  • Single-task production: Merge for simplicity and standard deployment.
  • Multi-task or research: Keep separate for flexibility and modularity.
  • QLoRA deployments: Keep separate unless quality loss is acceptable.

Merge Workflow for Production

Here's a complete pipeline from training to production deployment:

# 1. After training, save the LoRA adapter
# (This was done in Article 6)

# 2. Merge for production
from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16
)

model_with_adapter = PeftModel.from_pretrained(
base_model,
"./llama2-7b-customer-support-adapter"
)

merged_model = model_with_adapter.merge_and_unload()

# 3. Convert to float16 for memory efficiency
merged_model = merged_model.to(torch.float16)

# 4. Save merged model
merged_model.save_pretrained("./llama2-7b-customer-support-v1.0")
tokenizer.save_pretrained("./llama2-7b-customer-support-v1.0")

# 5. Deploy to production (see Article 10)
# Push to Hugging Face Hub, TorchServe, or vLLM

# 6. At inference:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
"./llama2-7b-customer-support-v1.0",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./llama2-7b-customer-support-v1.0")

# Standard inference, no PEFT

Checking Merge Success

Verify that adapters were properly merged:

# Before merge: model has LoRA layers
model = PeftModel.from_pretrained(base_model, adapter_dir)
print("Before merge:")
print([name for name, _ in model.named_modules() if "lora" in name.lower()])
# Output: ['model.layers.0.self_attn.q_proj.lora_A.default', 'model.layers.0.self_attn.q_proj.lora_B.default', ...]

# After merge: LoRA layers removed
merged_model = model.merge_and_unload()
print("\nAfter merge:")
print([name for name, _ in merged_model.named_modules() if "lora" in name.lower()])
# Output: [] (empty)

# Verify weights were updated
print("\nWeight shapes match original:")
print(f"Original q_proj: {base_model.model.layers[0].self_attn.q_proj.weight.shape}")
print(f"Merged q_proj: {merged_model.model.layers[0].self_attn.q_proj.weight.shape}")
# Both should be (4096, 4096)

# Verify weights differ (merge added adapter)
weights_identical = torch.allclose(
base_model.model.layers[0].self_attn.q_proj.weight,
merged_model.model.layers[0].self_attn.q_proj.weight,
atol=1e-5
)
print(f"\nWeights changed after merge: {not weights_identical}")
# Should print: True

Key Takeaways

  • Separate adapters: Small, composable, easy to version. Recommended for research and multi-task deployments.
  • Merged models: Simple, standard, faster inference. Recommended for single-task production.
  • model.merge_and_unload() permanently incorporates the LoRA correction and removes adapter layers.
  • QLoRA models should typically keep adapters separate; merging requires re-quantization.
  • After merging, models are standard Hugging Face checkpoints; no PEFT dependency needed.

Frequently Asked Questions

Can I unmerge an adapter after merging?

No. Merging is irreversible; you've overwritten the base weights. If you need to switch adapters, keep the original base model and load different adapters into it separately.

Does merging change the model's accuracy?

No, merging is mathematically exact (assuming floating-point precision). The merged model produces identical outputs to the separate base + adapter model.

What if I merged the wrong adapter by accident?

Start over: reload the original base model and merge the correct adapter. Keep the original base model as a backup.

Can I merge multiple adapters into one model?

Not directly via merge_and_unload(), which merges one adapter per base model. To combine multiple adapters, you'd need custom code. See Article 9 on composition for alternatives.

Further Reading