Skip to main content

Training LoRA Adapters on Consumer GPUs: Complete Workflow

Training a LoRA adapter involves preparing your dataset, setting up the Hugging Face Trainer, monitoring training progress, and iterating based on validation metrics. This guide walks through a complete workflow: loading data, configuring training hyperparameters, tracking metrics with Weights & Biases, handling GPU memory constraints, and saving the final adapter. By the end, you'll have a production-ready fine-tuned model checkpoint on your consumer GPU.

Prepare Your Dataset

LoRA training expects data in a standard format. For instruction-following tasks, the format is typically:

{"text": "Instruction: Classify this customer message.\n\nMessage: I can't log into my account.\n\nResponse: Issue"}
{"text": "Instruction: Classify this customer message.\n\nMessage: How do I change my password?\n\nResponse: Request"}

Load the dataset using Hugging Face datasets library:

from datasets import load_dataset, DatasetDict

# Load from local JSON file
dataset = load_dataset("json", data_files="train.jsonl")

# Split into train and validation (80/20)
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
val_dataset = dataset["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")

# Inspect a sample
print(train_dataset[0])

Tokenize the Dataset

Prepare text for the model by tokenizing:

from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
"""Tokenize and truncate texts."""
tokenized = tokenizer(
examples["text"],
truncation=True,
max_length=512, # Adjust based on your GPU and dataset
padding="max_length"
)
tokenized["labels"] = tokenized["input_ids"].copy() # For language modeling
return tokenized

# Tokenize all data (may take a few minutes for large datasets)
tokenized_train = train_dataset.map(
tokenize_function,
batched=True,
remove_columns=["text"]
)

tokenized_val = val_dataset.map(
tokenize_function,
batched=True,
remove_columns=["text"]
)

print(f"Tokenized training set: {len(tokenized_train)} examples")

Set Up the LoRA Model

Load the base model and inject LoRA adapters (from Article 4):

from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
import torch

# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)

# Inject adapters
model = get_peft_model(model, lora_config)

# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()

# Print trainable parameters
model.print_trainable_parameters()

Configure the Trainer

The Hugging Face Trainer handles the training loop, gradient accumulation, validation, and checkpointing:

from transformers import Trainer, TrainingArguments
from peft import PeftConfig

training_args = TrainingArguments(
output_dir="./lora-checkpoints", # Where to save checkpoints

# Learning and optimization
learning_rate=5e-4,
num_train_epochs=3,
per_device_train_batch_size=8, # Adjust based on GPU VRAM
per_device_eval_batch_size=16,
gradient_accumulation_steps=4, # Effective batch size = 8 * 4 = 32

# Evaluation and saving
eval_strategy="steps", # Evaluate every N steps
eval_steps=100,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,

# Logging and monitoring
logging_steps=10,
logging_dir="./logs",
report_to=["wandb"], # Log to Weights & Biases (optional; use [] to disable)

# Hardware
optim="adamw_8bit", # Use 8-bit AdamW for memory efficiency
warmup_steps=100,
weight_decay=0.01,
max_grad_norm=1.0, # Gradient clipping

# Advanced
seed=42,
dataloader_pin_memory=True, # Faster data loading
dataloader_num_workers=4
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
data_collator=transformers.default_data_collator, # Default padding collator
)

print(f"Training with effective batch size: {8 * 4} (per_device * accumulation_steps)")

Key parameters explained:

  • per_device_train_batch_size: Batch size per GPU. Start low (4–8) if you have VRAM constraints.
  • gradient_accumulation_steps: Accumulate gradients over N steps before updating. Effective batch size = per_device_batch_size × accumulation_steps. Useful if your GPU can't fit larger batches.
  • eval_steps: Evaluate every N steps. For 5,000 training examples and batch size 32, ~150 steps = 1 epoch. Set to ~150–200 for frequent validation.
  • optim="adamw_8bit": Use 8-bit AdamW to reduce optimizer memory (bitsandbytes required).
  • warmup_steps: Gradually increase LR from 0 to peak over N steps. Prevents training instability. Use 5–10% of total steps.

Estimate Memory Usage

Before training, check if your setup fits on your GPU:

# Rough memory calculator
model_size_gb = model.num_parameters() * 2 / (1024**3) # float16 = 2 bytes per param
gradient_size_gb = model.num_parameters() * 2 / (1024**3) # gradients
optimizer_state_gb = model.num_parameters() * 4 / (1024**3) # Adam (2 states × 2 bytes)
activation_size_gb = 2 * per_device_batch_size * seq_length * hidden_dim * 2 / (1024**3)

total_gb = model_size_gb + gradient_size_gb + optimizer_state_gb + activation_size_gb

print(f"Model: {model_size_gb:.1f} GB")
print(f"Gradients: {gradient_size_gb:.1f} GB")
print(f"Optimizer: {optimizer_state_gb:.1f} GB")
print(f"Activations: {activation_size_gb:.1f} GB")
print(f"Total: {total_gb:.1f} GB")

# Compare to your GPU VRAM
gpu_vram_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
print(f"GPU VRAM: {gpu_vram_gb:.1f} GB")

if total_gb > gpu_vram_gb * 0.9:
print("WARNING: May OOM. Reduce batch size, seq length, or enable gradient checkpointing.")

For a 7B model on a 24 GB GPU:

  • Model: 14 GB (float16)
  • Gradients: 14 GB
  • Optimizer: 28 GB (too much!)
  • Solution: Use 8-bit optimizer (7 GB) or gradient accumulation.

With 8-bit Adam, LoRA adapters, and gradient checkpointing:

  • Effective memory: ~20 GB (fits on 24 GB GPU).

Start Training

Launch the training process:

# Start training
train_result = trainer.train()

# Save the trained adapter
model.save_pretrained("./llama2-7b-customer-support-adapter")
tokenizer.save_pretrained("./llama2-7b-customer-support-adapter")

print("Training complete!")
print(f"Final eval loss: {train_result.metrics['eval_loss']:.4f}")

The Trainer will:

  1. Train for 3 epochs on the training dataset.
  2. Evaluate on the validation set every 100 steps.
  3. Save checkpoints when validation loss improves.
  4. Load the best checkpoint automatically.
  5. Log metrics to Weights & Biases (if configured).

Monitor Training with Weights & Biases

For detailed metrics tracking, integrate Weights & Biases (free tier available):

# Install WandB
pip install wandb

# Authenticate
wandb login

Training metrics (loss, learning rate, gradient norm) are automatically logged. View them at https://wandb.ai/your-username/your-project:

# In TrainingArguments, set report_to=["wandb"]
# After training, view the run URL:
print(trainer.state.wandb_run.url)

Key metrics to monitor:

  • Training loss: Should decrease smoothly; sudden spikes indicate instability.
  • Validation loss: Should decrease then plateau; if it increases, you're overfitting.
  • Learning rate: Should follow the configured schedule.
  • Gradient norm: Should stay <5; if much larger, reduce learning rate.

Resume Training from Checkpoint

If training is interrupted, resume from the last checkpoint:

# Resume from last checkpoint
trainer.train(resume_from_checkpoint=True)

# Or, resume from a specific checkpoint:
trainer.train(resume_from_checkpoint="./lora-checkpoints/checkpoint-500")

Save and Load the Final Adapter

Save the trained adapter:

# Save adapter (small, ~50 MB for rank 16)
adapter_dir = "./llama2-7b-customer-support-adapter"
model.save_pretrained(adapter_dir)

# Also save tokenizer
tokenizer.save_pretrained(adapter_dir)

print(f"Adapter saved to {adapter_dir}")

Load it later:

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)

# Load adapter
model = PeftModel.from_pretrained(
base_model,
adapter_dir,
is_trainable=False # For inference only
)

# Run inference
input_text = "Instruction: Classify this: I lost my password."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Complete Training Script

Here's a runnable, end-to-end example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

# 1. Load and prepare data
dataset = load_dataset("json", data_files="train.jsonl")
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)

# 2. Tokenize
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

def tokenize(examples):
tok = tokenizer(examples["text"], truncation=True, max_length=512)
tok["labels"] = tok["input_ids"].copy()
return tok

train_data = dataset["train"].map(tokenize, batched=True, remove_columns=["text"])
val_data = dataset["test"].map(tokenize, batched=True, remove_columns=["text"])

# 3. Setup model and LoRA
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)

lora_config = LoraConfig(
r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.gradient_checkpointing_enable()

# 4. Train
args = TrainingArguments(
output_dir="./checkpoints",
learning_rate=5e-4,
num_train_epochs=3,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
eval_strategy="steps",
eval_steps=100,
save_steps=100,
load_best_model_at_end=True,
optim="adamw_8bit",
report_to=[]
)

trainer = Trainer(
model=model,
args=args,
train_dataset=train_data,
eval_dataset=val_data
)

trainer.train()

# 5. Save
model.save_pretrained("./final-adapter")
tokenizer.save_pretrained("./final-adapter")
print("Done!")

Key Takeaways

  • Prepare datasets in a standard format (JSON lines with text field).
  • Tokenize and preprocess data using the tokenizer from the base model.
  • Configure the Trainer with appropriate batch size, learning rate, and validation frequency.
  • Monitor training loss and validation loss for signs of divergence or overfitting.
  • Use gradient accumulation and 8-bit optimizers to fit on consumer GPUs.
  • Save adapters separately from the base model; they're small and portable.

Frequently Asked Questions

My training OOMs (out of memory). What can I do?

  1. Reduce per_device_train_batch_size (try 4 instead of 8).
  2. Increase gradient_accumulation_steps to maintain effective batch size.
  3. Reduce max_length in tokenization (try 256 or 384).
  4. Enable optim="adamw_8bit" if not already.
  5. Ensure gradient_checkpointing_enable() is called.

Why does validation loss increase while training loss decreases?

You're overfitting—the model is memorizing the training set. Reduce training time (num_train_epochs), increase dropout (lora_dropout), or add weight decay (weight_decay=0.1).

How do I continue training from a checkpoint?

Call trainer.train(resume_from_checkpoint=True) to resume from the latest checkpoint, or specify a checkpoint path.

Can I use multiple GPUs?

Yes. Set device_map="auto" (as shown), and the Trainer will distribute batches across all available GPUs automatically.

Further Reading