Skip to main content

Choosing LoRA Rank and Alpha: Hyperparameter Tuning

LoRA's effectiveness depends on careful hyperparameter choices. The rank r controls the expressiveness of the adapter; the alpha α scales the update magnitude; learning rate and batch size affect training dynamics. Rather than guessing, you can empirically validate these choices on a small held-out validation set, selecting hyperparameters that maximize downstream task performance. This guide provides strategies for grid search, analyzes the sensitivity of each parameter, and offers practical rules of thumb for common scenarios, enabling you to avoid expensive trial-and-error.

The Key Hyperparameters and Their Roles

LoRA-specific:

  • Rank r: Size of the low-rank matrices. Higher rank increases expressiveness and parameters; typical range 4–64.
  • Alpha α (lora_alpha): Scales the update: Delta_W = (α / r) × U @ V^T. Usually set to 2 × r; controls effective learning rate.

Training-wide:

  • Learning rate lr: Step size for gradient descent. Typical range 1e-4 to 5e-3.
  • Batch size B: Number of examples per gradient step. Typical range 4–64 (limited by GPU VRAM).
  • Weight decay: L2 regularization on trainable parameters; typical range 0.01–0.1.
  • Dropout (in LoRA): Regularization within adapters; typical 0.05–0.1.

Why it matters: A rank that's too low yields poor downstream performance; too high wastes GPU memory and training time. An alpha that's too large causes divergence; too small prevents effective adaptation. Misaligned learning rate amplifies both problems.

Sensitivity Analysis: Which Parameters Matter Most?

Research from Meta and Microsoft empirically measures each parameter's impact:

ParameterImpact on QualityImpact on MemoryImpact on Speed
Rank rVery highVery highHigh
Alpha αMedium (affects learning dynamics)NoneNone
Learning rateVery highNoneNone
Batch sizeHigh (stability + compute)HighHigh
DropoutLowNoneNone
Weight decayLowNoneNone

The order of importance: Rank, learning rate, batch size, alpha, then dropout and weight decay. Tune in that priority order.

Step 1: Determine Baseline Rank

Start by choosing a rank based on task complexity (from Article 2):

  • Instruction-tuning or simple classification: Rank 4–8
  • Domain adaptation or moderate complexity: Rank 8–16
  • Complex linguistic tasks (translation, reasoning): Rank 16–32

Why start here? Task complexity determines intrinsic dimensionality. Instruction-tuning (teaching a model to follow diverse prompts) doesn't require learning new language patterns—you're just rewiring the output layer, which is low-dimensional. Domain adaptation (e.g., legal documents) requires learning domain-specific word distributions, higher dimensional but still low-rank. Complex linguistic tasks (multilingual translation) require learning structural mappings, highest intrinsic dimension.

Example: Suppose you're fine-tuning Llama 2–7B on customer support classification (binary: issue or inquiry). This is a simple task—you're adapting a well-trained language model to recognize support patterns. Start with rank 4 or 8.

Train models with different ranks on a small subset of your training data (20–50% of full data, or ~1,000 examples). Evaluate on a validation set. Plot downstream metric (accuracy, F1, perplexity) vs. rank:

import numpy as np
from sklearn.metrics import accuracy_score
import json

ranks = [4, 8, 16, 32, 64]
results = {}

for r in ranks:
# 1. Create LoRA config with this rank
lora_config = LoraConfig(
r=r,
lora_alpha=2*r, # Scale alpha with rank
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)

# 2. Train on subset (e.g., 1000 examples)
model = get_peft_model(base_model, lora_config)
# ... train for N epochs on subset ...

# 3. Evaluate on validation set
predictions = model.generate(val_inputs)
accuracy = accuracy_score(val_labels, predictions)

results[r] = {"accuracy": accuracy, "params": 2*4096*r}
print(f"Rank {r}: Accuracy {accuracy:.4f}, Params {results[r]['params']:,}")

# Plot results
print(json.dumps(results, indent=2))

What to expect: Accuracy typically increases logarithmically with rank:

  • Rank 4: 85% accuracy
  • Rank 8: 91% accuracy
  • Rank 16: 94% accuracy
  • Rank 32: 95% accuracy
  • Rank 64: 95.5% accuracy

The curve flattens around rank 16–32. Choose the rank at the "elbow" where improvements plateau. For customer support classification, you'd likely pick rank 8 or 16.

Step 3: Optimize Learning Rate and Batch Size

Once you've fixed rank, tune learning rate and batch size jointly (they interact). Common values:

learning_rates = [1e-4, 5e-4, 1e-3, 5e-3]
batch_sizes = [4, 8, 16, 32]

best_lr, best_bs, best_loss = None, None, float('inf')

for lr in learning_rates:
for bs in batch_sizes:
# Check memory: does this config fit on your GPU?
if estimate_memory(model, bs) > gpu_vram:
continue

# Train a small validation run (e.g., 500 steps)
model = get_peft_model(base_model, lora_config)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

# ... train for 500 steps ...

val_loss = evaluate(model, val_dataset)

if val_loss < best_loss:
best_loss = val_loss
best_lr = lr
best_bs = bs

print(f"LR {lr}, BS {bs}: val_loss {val_loss:.4f}")

print(f"Best: LR {best_lr}, BS {best_bs}, loss {best_loss:.4f}")

Practical rules of thumb (2026):

For instruction-tuning:

  • Learning rate: 1e-4 to 5e-4 (lower than full fine-tuning)
  • Batch size: 16–32 (stability benefits from larger batches)

For domain adaptation:

  • Learning rate: 5e-4 to 1e-3
  • Batch size: 8–16 (domain shifts can be sensitive)

Why lower learning rate for LoRA? LoRA adapters are small, so gradients are dense (high curvature). A learning rate optimal for full fine-tuning overshoots. Reduce by 2–5×.

Step 4: Tune Alpha

Alpha is less sensitive than rank or learning rate, but still matters. Set alpha = 2 × r as a default, then test nearby values:

alphas = [r, r*1.5, 2*r, 3*r]  # e.g., for rank 16: [16, 24, 32, 48]

for alpha in alphas:
lora_config = LoraConfig(
r=16,
lora_alpha=alpha,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)

# Train and evaluate
model = get_peft_model(base_model, lora_config)
val_loss = train_and_evaluate(model, train_data, val_data)

print(f"Alpha {alpha}: val_loss {val_loss:.4f}")

What to expect: Alpha has a relatively flat landscape. Values in the range [1.5r, 3r] typically give similar results. Alpha controls the magnitude of updates; too low and adapters have little effect, too high and training diverges. The default 2r is empirically optimal for most tasks.

A Concrete Example: Customer Support Classifier

You're fine-tuning Llama 2–7B on a 5,000-example customer support dataset (binary: issue or request). Here's the hyperparameter search workflow:

Stage 1: Rank Search (1 hour)

  • Train 5 models (rank 4, 8, 16, 32, 64) on 1,000 examples, 1 epoch.
  • Evaluate on 500 validation examples.
  • Plot accuracy vs. rank. Elbow at rank 8, modest gain to rank 16.
  • Decision: Use rank 8 as baseline.

Stage 2: Learning Rate & Batch Size (3 hours)

  • Grid: [1e-4, 5e-4, 1e-3] × [8, 16, 32] = 9 configs.
  • Train each on full 5,000 examples, 3 epochs, evaluate on validation.
  • Best: LR 5e-4, batch size 16 (val_loss 0.35).

Stage 3: Final Tuning (4 hours)

  • Train rank 8, LR 5e-4, BS 16, alpha 16, for full epochs with early stopping.
  • Monitor validation loss; stop if it doesn't improve for 2 epochs.
  • Final model: 99.2% accuracy on test set.

Total time: 8 hours on a single A100 GPU, vs. 40+ hours for full fine-tuning on the same hardware.

Early Stopping and Validation Strategy

Always hold out a validation set (10–20% of data) to detect overfitting:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir="./checkpoints",
eval_strategy="steps", # Evaluate every N steps
eval_steps=100,
logging_steps=50,
learning_rate=5e-4,
per_device_train_batch_size=16,
num_train_epochs=3,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss", # Stop if eval_loss doesn't improve
greater_is_better=False,
logging_dir="./logs",
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)

trainer.train()

The Trainer stops automatically if validation loss plateaus, preventing overfitting and saving GPU hours.

Key Takeaways

  • Rank selection is the highest-impact hyperparameter choice; empirically tune via grid search on 20–50% of training data.
  • Learning rate is critical and typically 2–5× lower for LoRA than full fine-tuning (1e-4 to 1e-3 range).
  • Batch size affects stability; larger batches (16–32) are preferable if VRAM allows.
  • Alpha scales the update magnitude; set to 2 × r by default, tune nearby values if needed.
  • Always use early stopping on a validation set to prevent overfitting and save compute.

Frequently Asked Questions

How do I know if my rank is too high?

If training is slow or you run out of VRAM with rank r, you can reduce it. If validation accuracy is the same for rank 8 and rank 32, rank 8 was sufficient. Use the elbow method: plot accuracy vs. rank and pick the elbow (steepest bend), not the asymptote.

My training diverges (loss goes to NaN). What went wrong?

Usually, the learning rate is too high for your rank and batch size. Halve the learning rate and retrain. Alternatively, your alpha is too large; reduce it from 2r to 1.5r. Finally, check that your training data has no corrupted examples (NaN, Inf values).

Should I tune dropout and weight decay?

These have lower impact than rank, LR, and batch size. Start with defaults (dropout 0.05, weight decay 0.01) and only tune if you have time. If your model overfits (training loss much lower than val loss), increase dropout to 0.1.

Can I use the same hyperparameters across different base models?

Partially. If you're fine-tuning Llama 2–7B and Llama 2–70B on the same task, the optimal rank might differ slightly (larger models can support higher rank), but learning rate and batch size are often transferable. Test a few configs on the new model before running full search.

Further Reading