Skip to main content

Setting Up LoRA with Hugging Face Transformers

Setting up LoRA for your first fine-tuning job requires just three steps: load a pre-trained model, define LoRA hyperparameters, and inject adapters into target layers. The Hugging Face PEFT library automates this process with a simple API, letting you go from model download to training-ready in under five minutes. This guide covers the complete setup workflow, from installation through model saving, with code examples for both CPU and GPU environments.

Installation and Requirements

Install the PEFT library alongside Transformers and PyTorch. PEFT (Parameter-Efficient Fine-Tuning) is the official Hugging Face library for LoRA, adapter, and prefix-tuning methods.

# Install PEFT (includes LoRA support)
pip install peft>=0.4.0

# Install Transformers (if not already installed)
pip install transformers>=4.31.0

# Install torch (if not already installed; this example uses GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Verify installation:

import peft
import transformers
import torch

print(f"PEFT version: {peft.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"GPU available: {torch.cuda.is_available()}")

Step 1: Load a Pre-trained Model

Choose a base model from Hugging Face Hub. For examples in this series, we'll use Llama 2 7B (open-source, widely used). You'll need to accept the model license on Hugging Face first.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Model ID from Hugging Face Hub
model_id = "meta-llama/Llama-2-7b-hf"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # For batch processing

# Load model (float16 for memory efficiency)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto" # Automatically places model on GPU if available
)

print(f"Model loaded: {model.config.model_type}")
print(f"Total parameters: {model.num_parameters():,}")

Key options:

  • torch_dtype=torch.float16: Use 16-bit precision to halve memory (28 GB → 14 GB for 7B model).
  • device_map="auto": Automatically splits model across available GPUs or CPU.
  • token="hf_...": Pass your Hugging Face API token if the model requires authentication.

Step 2: Define LoRA Configuration

A LoraConfig object specifies which layers get adapters and their hyperparameters:

from peft import LoraConfig

lora_config = LoraConfig(
r=16, # Rank of the low-rank matrices
lora_alpha=32, # Scaling factor (typically 2 * r)
target_modules=["q_proj", "v_proj"], # Which attention projections to apply LoRA to
lora_dropout=0.05, # Dropout for regularization
bias="none", # Whether to add a bias term in LoRA layers
task_type="CAUSAL_LM" # Task type (CAUSAL_LM for GPT-style models)
)

print(lora_config)

Hyperparameter explanations:

  • r (rank): 8–16 for instruction-tuning, 16–64 for domain adaptation. Start with 16.
  • lora_alpha: Scaling factor. Set to 2 × r (so rank 16 → alpha 32). Controls the magnitude of the adapter update.
  • target_modules: Which weight matrices get adapters. Common choices:
    • Llama 2: ["q_proj", "v_proj", "k_proj", "o_proj"] (all attention) or just ["q_proj", "v_proj"].
    • GPT-2: ["c_attn"] (single projection with all heads).
    • BERT: ["query", "value"] (BERT naming convention).
  • lora_dropout: Dropout applied to LoRA inputs (0.05–0.1 typical).
  • bias: "none" (no bias), "lora_only" (LoRA layers have bias), or "all" (all layers). "none" is most common.
  • task_type: "CAUSAL_LM" (GPT-style language modeling), "SEQ_2_SEQ_LM" (encoder-decoder), "TOKEN_CLS" (token classification).

Step 3: Inject LoRA into the Model

Use get_peft_model to wrap your model and inject adapters:

from peft import get_peft_model

# Inject LoRA adapters
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
# Expected output for 7B model with rank 16:
# trainable params: 4,194,304 || all params: 6,738,415,616 || trainable: 0.06%

You've now created a LoRA-enabled model where:

  • The original weights are frozen.
  • Only the LoRA adapter matrices (U and V) are trainable.
  • Trainable parameters: ~4M (0.06% of 7B).

Step 4: Inspect the Model

Understand what you've created:

# Examine model structure
for name, module in model.named_modules():
if "lora" in name.lower():
print(f"{name}: {module}")

# Sample output:
# model.model.layers.0.self_attn.q_proj.lora_A.default: Linear(in_features=4096, out_features=16)
# model.model.layers.0.self_attn.q_proj.lora_B.default: Linear(in_features=16, out_features=4096)

Each target module now has lora_A (the U matrix, projects to rank) and lora_B (the V matrix, projects back to original dimension).

Step 5: Prepare for Training

Enable gradient checkpointing to reduce memory, and verify the model is trainable:

# Enable gradient checkpointing to save memory during training
model.gradient_checkpointing_enable()

# Verify only LoRA parameters have requires_grad=True
trainable_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen_count = sum(p.numel() for p in model.parameters() if not p.requires_grad)

print(f"Trainable parameters: {trainable_count:,}")
print(f"Frozen parameters: {frozen_count:,}")
print(f"Total parameters: {trainable_count + frozen_count:,}")

Step 6: Save and Load Adapters

After training, save only the adapters (not the base model):

# Save adapter to local directory
output_dir = "./llama2-7b-custom-adapter"
model.save_pretrained(output_dir)

# This saves only the LoRA weights (~50 MB for a 7B model with rank 16)
# File structure:
# llama2-7b-custom-adapter/
# ├─ adapter_config.json (hyperparameter metadata)
# ├─ adapter_model.bin (serialized LoRA matrices)
# └─ README.md (documentation)

To load the adapter later:

from peft import PeftModel

# Start with base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)

# Load and merge adapter
model_with_adapter = PeftModel.from_pretrained(
base_model,
"./llama2-7b-custom-adapter"
)

print(model_with_adapter)

Complete Setup Example

Here's a complete, runnable example that downloads a model, injects LoRA, and prepares it for training:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
import torch

# 1. Load base model and tokenizer
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)

# 2. Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)

# 3. Inject LoRA
model = get_peft_model(model, lora_config)

# 4. Enable gradient checkpointing
model.gradient_checkpointing_enable()

# 5. Verify setup
model.print_trainable_parameters()

# 6. Save for later
model.save_pretrained("./llama2-7b-lora")
print("Setup complete! Model ready for training.")

Choosing Target Modules by Model Family

Different model architectures use different layer names. Here's a reference:

ModelTarget Modules
Llama 2 / Llama 3["q_proj", "v_proj"] or ["q_proj", "v_proj", "k_proj", "o_proj"]
Mistral 7B["q_proj", "v_proj"]
Falcon["query_key_value"] (single projection)
GPT-2["c_attn"]
BERT["query", "value"]
T5 (encoder-decoder)["q", "v"]

To find the correct names for your model, inspect the architecture:

for name, _ in model.named_modules():
if "linear" in name.lower() or "proj" in name.lower():
print(name)

Key Takeaways

  • Install PEFT and Transformers, then load a pre-trained model using AutoModelForCausalLM.
  • Define a LoraConfig with rank, alpha, target modules, and dropout.
  • Inject adapters with get_peft_model(model, config) to add trainable LoRA layers.
  • Enable gradient checkpointing to reduce memory usage during training.
  • Save adapters separately from the base model using model.save_pretrained().
  • Load adapters later without downloading the full model using PeftModel.from_pretrained().

Frequently Asked Questions

What if I get an out-of-memory error during setup?

Reduce torch_dtype to float8 (requires bitsandbytes), use smaller batch sizes, or enable 4-bit quantization via QLoRA (Article 3). Gradient checkpointing helps but is enabled after setup.

Can I apply LoRA to embedding layers?

By default, LoRA targets attention and feed-forward projections. Embedding layers are rarely fine-tuned because they have fewer parameters (13M for a 70K vocabulary in a 7B model). You can manually add LoRA to embeddings via modules_to_save in the config, but it's uncommon.

How do I choose between ["q_proj", "v_proj"] and all four attention projections?

["q_proj", "v_proj"] is standard and highly effective (99% quality of all-four). Using all four (["q_proj", "k_proj", "v_proj", "o_proj"]) adds 33% more parameters but rarely improves downstream performance noticeably. Start with two; expand if validation metrics plateau.

What does bias="lora_only" do?

Adds a trainable bias term to the LoRA layers (not the base model weights). Slightly increases parameters (~1%) and can improve expressiveness. Most practitioners use "none" for maximum efficiency.

Further Reading