Training LoRA Adapters on Consumer GPUs: Complete Workflow
Training a LoRA adapter involves preparing your dataset, setting up the Hugging Face Trainer, monitoring training progress, and iterating based on validation metrics. This guide walks through a complete workflow: loading data, configuring training hyperparameters, tracking metrics with Weights & Biases, handling GPU memory constraints, and saving the final adapter. By the end, you'll have a production-ready fine-tuned model checkpoint on your consumer GPU.
Prepare Your Dataset
LoRA training expects data in a standard format. For instruction-following tasks, the format is typically:
{"text": "Instruction: Classify this customer message.\n\nMessage: I can't log into my account.\n\nResponse: Issue"}
{"text": "Instruction: Classify this customer message.\n\nMessage: How do I change my password?\n\nResponse: Request"}
Load the dataset using Hugging Face datasets library:
from datasets import load_dataset, DatasetDict
# Load from local JSON file
dataset = load_dataset("json", data_files="train.jsonl")
# Split into train and validation (80/20)
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
val_dataset = dataset["test"]
print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
# Inspect a sample
print(train_dataset[0])
Tokenize the Dataset
Prepare text for the model by tokenizing:
from transformers import AutoTokenizer
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
"""Tokenize and truncate texts."""
tokenized = tokenizer(
examples["text"],
truncation=True,
max_length=512, # Adjust based on your GPU and dataset
padding="max_length"
)
tokenized["labels"] = tokenized["input_ids"].copy() # For language modeling
return tokenized
# Tokenize all data (may take a few minutes for large datasets)
tokenized_train = train_dataset.map(
tokenize_function,
batched=True,
remove_columns=["text"]
)
tokenized_val = val_dataset.map(
tokenize_function,
batched=True,
remove_columns=["text"]
)
print(f"Tokenized training set: {len(tokenized_train)} examples")
Set Up the LoRA Model
Load the base model and inject LoRA adapters (from Article 4):
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
import torch
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Inject adapters
model = get_peft_model(model, lora_config)
# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()
# Print trainable parameters
model.print_trainable_parameters()
Configure the Trainer
The Hugging Face Trainer handles the training loop, gradient accumulation, validation, and checkpointing:
from transformers import Trainer, TrainingArguments
from peft import PeftConfig
training_args = TrainingArguments(
output_dir="./lora-checkpoints", # Where to save checkpoints
# Learning and optimization
learning_rate=5e-4,
num_train_epochs=3,
per_device_train_batch_size=8, # Adjust based on GPU VRAM
per_device_eval_batch_size=16,
gradient_accumulation_steps=4, # Effective batch size = 8 * 4 = 32
# Evaluation and saving
eval_strategy="steps", # Evaluate every N steps
eval_steps=100,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
# Logging and monitoring
logging_steps=10,
logging_dir="./logs",
report_to=["wandb"], # Log to Weights & Biases (optional; use [] to disable)
# Hardware
optim="adamw_8bit", # Use 8-bit AdamW for memory efficiency
warmup_steps=100,
weight_decay=0.01,
max_grad_norm=1.0, # Gradient clipping
# Advanced
seed=42,
dataloader_pin_memory=True, # Faster data loading
dataloader_num_workers=4
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
data_collator=transformers.default_data_collator, # Default padding collator
)
print(f"Training with effective batch size: {8 * 4} (per_device * accumulation_steps)")
Key parameters explained:
per_device_train_batch_size: Batch size per GPU. Start low (4–8) if you have VRAM constraints.gradient_accumulation_steps: Accumulate gradients over N steps before updating. Effective batch size =per_device_batch_size × accumulation_steps. Useful if your GPU can't fit larger batches.eval_steps: Evaluate every N steps. For 5,000 training examples and batch size 32, ~150 steps = 1 epoch. Set to ~150–200 for frequent validation.optim="adamw_8bit": Use 8-bit AdamW to reduce optimizer memory (bitsandbytes required).warmup_steps: Gradually increase LR from 0 to peak over N steps. Prevents training instability. Use 5–10% of total steps.
Estimate Memory Usage
Before training, check if your setup fits on your GPU:
# Rough memory calculator
model_size_gb = model.num_parameters() * 2 / (1024**3) # float16 = 2 bytes per param
gradient_size_gb = model.num_parameters() * 2 / (1024**3) # gradients
optimizer_state_gb = model.num_parameters() * 4 / (1024**3) # Adam (2 states × 2 bytes)
activation_size_gb = 2 * per_device_batch_size * seq_length * hidden_dim * 2 / (1024**3)
total_gb = model_size_gb + gradient_size_gb + optimizer_state_gb + activation_size_gb
print(f"Model: {model_size_gb:.1f} GB")
print(f"Gradients: {gradient_size_gb:.1f} GB")
print(f"Optimizer: {optimizer_state_gb:.1f} GB")
print(f"Activations: {activation_size_gb:.1f} GB")
print(f"Total: {total_gb:.1f} GB")
# Compare to your GPU VRAM
gpu_vram_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
print(f"GPU VRAM: {gpu_vram_gb:.1f} GB")
if total_gb > gpu_vram_gb * 0.9:
print("WARNING: May OOM. Reduce batch size, seq length, or enable gradient checkpointing.")
For a 7B model on a 24 GB GPU:
- Model: 14 GB (float16)
- Gradients: 14 GB
- Optimizer: 28 GB (too much!)
- Solution: Use 8-bit optimizer (7 GB) or gradient accumulation.
With 8-bit Adam, LoRA adapters, and gradient checkpointing:
- Effective memory: ~20 GB (fits on 24 GB GPU).
Start Training
Launch the training process:
# Start training
train_result = trainer.train()
# Save the trained adapter
model.save_pretrained("./llama2-7b-customer-support-adapter")
tokenizer.save_pretrained("./llama2-7b-customer-support-adapter")
print("Training complete!")
print(f"Final eval loss: {train_result.metrics['eval_loss']:.4f}")
The Trainer will:
- Train for 3 epochs on the training dataset.
- Evaluate on the validation set every 100 steps.
- Save checkpoints when validation loss improves.
- Load the best checkpoint automatically.
- Log metrics to Weights & Biases (if configured).
Monitor Training with Weights & Biases
For detailed metrics tracking, integrate Weights & Biases (free tier available):
# Install WandB
pip install wandb
# Authenticate
wandb login
Training metrics (loss, learning rate, gradient norm) are automatically logged. View them at https://wandb.ai/your-username/your-project:
# In TrainingArguments, set report_to=["wandb"]
# After training, view the run URL:
print(trainer.state.wandb_run.url)
Key metrics to monitor:
- Training loss: Should decrease smoothly; sudden spikes indicate instability.
- Validation loss: Should decrease then plateau; if it increases, you're overfitting.
- Learning rate: Should follow the configured schedule.
- Gradient norm: Should stay <5; if much larger, reduce learning rate.
Resume Training from Checkpoint
If training is interrupted, resume from the last checkpoint:
# Resume from last checkpoint
trainer.train(resume_from_checkpoint=True)
# Or, resume from a specific checkpoint:
trainer.train(resume_from_checkpoint="./lora-checkpoints/checkpoint-500")
Save and Load the Final Adapter
Save the trained adapter:
# Save adapter (small, ~50 MB for rank 16)
adapter_dir = "./llama2-7b-customer-support-adapter"
model.save_pretrained(adapter_dir)
# Also save tokenizer
tokenizer.save_pretrained(adapter_dir)
print(f"Adapter saved to {adapter_dir}")
Load it later:
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
# Load adapter
model = PeftModel.from_pretrained(
base_model,
adapter_dir,
is_trainable=False # For inference only
)
# Run inference
input_text = "Instruction: Classify this: I lost my password."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
Complete Training Script
Here's a runnable, end-to-end example:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
# 1. Load and prepare data
dataset = load_dataset("json", data_files="train.jsonl")
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)
# 2. Tokenize
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
def tokenize(examples):
tok = tokenizer(examples["text"], truncation=True, max_length=512)
tok["labels"] = tok["input_ids"].copy()
return tok
train_data = dataset["train"].map(tokenize, batched=True, remove_columns=["text"])
val_data = dataset["test"].map(tokenize, batched=True, remove_columns=["text"])
# 3. Setup model and LoRA
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
lora_config = LoraConfig(
r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.gradient_checkpointing_enable()
# 4. Train
args = TrainingArguments(
output_dir="./checkpoints",
learning_rate=5e-4,
num_train_epochs=3,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
eval_strategy="steps",
eval_steps=100,
save_steps=100,
load_best_model_at_end=True,
optim="adamw_8bit",
report_to=[]
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_data,
eval_dataset=val_data
)
trainer.train()
# 5. Save
model.save_pretrained("./final-adapter")
tokenizer.save_pretrained("./final-adapter")
print("Done!")
Key Takeaways
- Prepare datasets in a standard format (JSON lines with
textfield). - Tokenize and preprocess data using the tokenizer from the base model.
- Configure the Trainer with appropriate batch size, learning rate, and validation frequency.
- Monitor training loss and validation loss for signs of divergence or overfitting.
- Use gradient accumulation and 8-bit optimizers to fit on consumer GPUs.
- Save adapters separately from the base model; they're small and portable.
Frequently Asked Questions
My training OOMs (out of memory). What can I do?
- Reduce
per_device_train_batch_size(try 4 instead of 8). - Increase
gradient_accumulation_stepsto maintain effective batch size. - Reduce
max_lengthin tokenization (try 256 or 384). - Enable
optim="adamw_8bit"if not already. - Ensure
gradient_checkpointing_enable()is called.
Why does validation loss increase while training loss decreases?
You're overfitting—the model is memorizing the training set. Reduce training time (num_train_epochs), increase dropout (lora_dropout), or add weight decay (weight_decay=0.1).
How do I continue training from a checkpoint?
Call trainer.train(resume_from_checkpoint=True) to resume from the latest checkpoint, or specify a checkpoint path.
Can I use multiple GPUs?
Yes. Set device_map="auto" (as shown), and the Trainer will distribute batches across all available GPUs automatically.
Further Reading
- Hugging Face Trainer API — Full Trainer documentation and advanced options.
- Efficient Training on a Single GPU — Memory optimization techniques.
- Weights & Biases Integration — Detailed logging and visualization guide.
- Training Your Own Model — Hugging Face course on fine-tuning and training.