Setting Up LoRA with Hugging Face Transformers
Setting up LoRA for your first fine-tuning job requires just three steps: load a pre-trained model, define LoRA hyperparameters, and inject adapters into target layers. The Hugging Face PEFT library automates this process with a simple API, letting you go from model download to training-ready in under five minutes. This guide covers the complete setup workflow, from installation through model saving, with code examples for both CPU and GPU environments.
Installation and Requirements
Install the PEFT library alongside Transformers and PyTorch. PEFT (Parameter-Efficient Fine-Tuning) is the official Hugging Face library for LoRA, adapter, and prefix-tuning methods.
# Install PEFT (includes LoRA support)
pip install peft>=0.4.0
# Install Transformers (if not already installed)
pip install transformers>=4.31.0
# Install torch (if not already installed; this example uses GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Verify installation:
import peft
import transformers
import torch
print(f"PEFT version: {peft.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"GPU available: {torch.cuda.is_available()}")
Step 1: Load a Pre-trained Model
Choose a base model from Hugging Face Hub. For examples in this series, we'll use Llama 2 7B (open-source, widely used). You'll need to accept the model license on Hugging Face first.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Model ID from Hugging Face Hub
model_id = "meta-llama/Llama-2-7b-hf"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # For batch processing
# Load model (float16 for memory efficiency)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto" # Automatically places model on GPU if available
)
print(f"Model loaded: {model.config.model_type}")
print(f"Total parameters: {model.num_parameters():,}")
Key options:
torch_dtype=torch.float16: Use 16-bit precision to halve memory (28 GB → 14 GB for 7B model).device_map="auto": Automatically splits model across available GPUs or CPU.token="hf_...": Pass your Hugging Face API token if the model requires authentication.
Step 2: Define LoRA Configuration
A LoraConfig object specifies which layers get adapters and their hyperparameters:
from peft import LoraConfig
lora_config = LoraConfig(
r=16, # Rank of the low-rank matrices
lora_alpha=32, # Scaling factor (typically 2 * r)
target_modules=["q_proj", "v_proj"], # Which attention projections to apply LoRA to
lora_dropout=0.05, # Dropout for regularization
bias="none", # Whether to add a bias term in LoRA layers
task_type="CAUSAL_LM" # Task type (CAUSAL_LM for GPT-style models)
)
print(lora_config)
Hyperparameter explanations:
- r (rank): 8–16 for instruction-tuning, 16–64 for domain adaptation. Start with 16.
- lora_alpha: Scaling factor. Set to
2 × r(so rank 16 → alpha 32). Controls the magnitude of the adapter update. - target_modules: Which weight matrices get adapters. Common choices:
- Llama 2:
["q_proj", "v_proj", "k_proj", "o_proj"](all attention) or just["q_proj", "v_proj"]. - GPT-2:
["c_attn"](single projection with all heads). - BERT:
["query", "value"](BERT naming convention).
- Llama 2:
- lora_dropout: Dropout applied to LoRA inputs (0.05–0.1 typical).
- bias:
"none"(no bias),"lora_only"(LoRA layers have bias), or"all"(all layers)."none"is most common. - task_type:
"CAUSAL_LM"(GPT-style language modeling),"SEQ_2_SEQ_LM"(encoder-decoder),"TOKEN_CLS"(token classification).
Step 3: Inject LoRA into the Model
Use get_peft_model to wrap your model and inject adapters:
from peft import get_peft_model
# Inject LoRA adapters
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
# Expected output for 7B model with rank 16:
# trainable params: 4,194,304 || all params: 6,738,415,616 || trainable: 0.06%
You've now created a LoRA-enabled model where:
- The original weights are frozen.
- Only the LoRA adapter matrices (
UandV) are trainable. - Trainable parameters: ~4M (0.06% of 7B).
Step 4: Inspect the Model
Understand what you've created:
# Examine model structure
for name, module in model.named_modules():
if "lora" in name.lower():
print(f"{name}: {module}")
# Sample output:
# model.model.layers.0.self_attn.q_proj.lora_A.default: Linear(in_features=4096, out_features=16)
# model.model.layers.0.self_attn.q_proj.lora_B.default: Linear(in_features=16, out_features=4096)
Each target module now has lora_A (the U matrix, projects to rank) and lora_B (the V matrix, projects back to original dimension).
Step 5: Prepare for Training
Enable gradient checkpointing to reduce memory, and verify the model is trainable:
# Enable gradient checkpointing to save memory during training
model.gradient_checkpointing_enable()
# Verify only LoRA parameters have requires_grad=True
trainable_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen_count = sum(p.numel() for p in model.parameters() if not p.requires_grad)
print(f"Trainable parameters: {trainable_count:,}")
print(f"Frozen parameters: {frozen_count:,}")
print(f"Total parameters: {trainable_count + frozen_count:,}")
Step 6: Save and Load Adapters
After training, save only the adapters (not the base model):
# Save adapter to local directory
output_dir = "./llama2-7b-custom-adapter"
model.save_pretrained(output_dir)
# This saves only the LoRA weights (~50 MB for a 7B model with rank 16)
# File structure:
# llama2-7b-custom-adapter/
# ├─ adapter_config.json (hyperparameter metadata)
# ├─ adapter_model.bin (serialized LoRA matrices)
# └─ README.md (documentation)
To load the adapter later:
from peft import PeftModel
# Start with base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
# Load and merge adapter
model_with_adapter = PeftModel.from_pretrained(
base_model,
"./llama2-7b-custom-adapter"
)
print(model_with_adapter)
Complete Setup Example
Here's a complete, runnable example that downloads a model, injects LoRA, and prepares it for training:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
import torch
# 1. Load base model and tokenizer
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# 2. Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# 3. Inject LoRA
model = get_peft_model(model, lora_config)
# 4. Enable gradient checkpointing
model.gradient_checkpointing_enable()
# 5. Verify setup
model.print_trainable_parameters()
# 6. Save for later
model.save_pretrained("./llama2-7b-lora")
print("Setup complete! Model ready for training.")
Choosing Target Modules by Model Family
Different model architectures use different layer names. Here's a reference:
| Model | Target Modules |
|---|---|
| Llama 2 / Llama 3 | ["q_proj", "v_proj"] or ["q_proj", "v_proj", "k_proj", "o_proj"] |
| Mistral 7B | ["q_proj", "v_proj"] |
| Falcon | ["query_key_value"] (single projection) |
| GPT-2 | ["c_attn"] |
| BERT | ["query", "value"] |
| T5 (encoder-decoder) | ["q", "v"] |
To find the correct names for your model, inspect the architecture:
for name, _ in model.named_modules():
if "linear" in name.lower() or "proj" in name.lower():
print(name)
Key Takeaways
- Install PEFT and Transformers, then load a pre-trained model using
AutoModelForCausalLM. - Define a
LoraConfigwith rank, alpha, target modules, and dropout. - Inject adapters with
get_peft_model(model, config)to add trainable LoRA layers. - Enable gradient checkpointing to reduce memory usage during training.
- Save adapters separately from the base model using
model.save_pretrained(). - Load adapters later without downloading the full model using
PeftModel.from_pretrained().
Frequently Asked Questions
What if I get an out-of-memory error during setup?
Reduce torch_dtype to float8 (requires bitsandbytes), use smaller batch sizes, or enable 4-bit quantization via QLoRA (Article 3). Gradient checkpointing helps but is enabled after setup.
Can I apply LoRA to embedding layers?
By default, LoRA targets attention and feed-forward projections. Embedding layers are rarely fine-tuned because they have fewer parameters (13M for a 70K vocabulary in a 7B model). You can manually add LoRA to embeddings via modules_to_save in the config, but it's uncommon.
How do I choose between ["q_proj", "v_proj"] and all four attention projections?
["q_proj", "v_proj"] is standard and highly effective (99% quality of all-four). Using all four (["q_proj", "k_proj", "v_proj", "o_proj"]) adds 33% more parameters but rarely improves downstream performance noticeably. Start with two; expand if validation metrics plateau.
What does bias="lora_only" do?
Adds a trainable bias term to the LoRA layers (not the base model weights). Slightly increases parameters (~1%) and can improve expressiveness. Most practitioners use "none" for maximum efficiency.
Further Reading
- Hugging Face PEFT Library — Full API documentation and examples.
- PEFT GitHub Repository — Source code and additional adapters (prefix-tuning, prompt-tuning).
- Transformers Documentation: Load Models — Detailed guide to loading and configuring models.
- Llama 2 Model Card — Model-specific configuration and requirements.