Skip to main content

LoRA Mathematics: Low-Rank Decomposition Theory

LoRA's effectiveness stems from a mathematical insight: the weight updates during fine-tuning are intrinsically low-rank, allowing you to express a large update matrix as a product of two much smaller matrices. This section derives the key equations, explains why low-rank parameterization works, and provides the intuition behind hyperparameter choices, enabling you to tune LoRA correctly for your domain.

The Weight Update Decomposition

In full fine-tuning, you update a weight matrix W (shape d_in × d_out) by computing gradients and performing gradient descent:

W_fine_tuned = W_pretrained + Delta_W

where Delta_W is the learned update. LoRA assumes that Delta_W is low-rank:

Delta_W = U @ V^T

Here, U has shape (d_in, r) and V has shape (d_out, r), where r is the rank and r ≪ min(d_in, d_out). During training, only U and V are learnable; W remains frozen. The total trainable parameters drop from d_in × d_out to (d_in + d_out) × r.

Parameter Reduction Example

For the query projection in a 7B-parameter Llama model:

  • Shape: 4096 × 4096 (hidden dimension squared)
  • Full fine-tuning: 4096 × 4096 = 16.8 million parameters
  • LoRA (rank 16): (4096 + 4096) × 16 = 131,072 parameters
  • Reduction ratio: 131,072 / 16,777,216 = 0.78% (99.2% fewer parameters)

The parameterization works because the model's foundational knowledge (stored in W) is frozen, and only task-specific information (stored in U @ V^T) is learned. Task-specific information is typically lower-dimensional than the full weight matrix.

Initialization: The Key to Stability

The initialization of U and V is crucial. The LoRA paper recommends:

  • U: Initialize to random Gaussian (mean 0, std dev proportional to 1 / sqrt(r))
  • V: Initialize to zero

This ensures that at the start of training, Delta_W = U @ 0 = 0, so the model begins with the pre-trained behavior and gradually adapts. Empirically, this initialization prevents training instability and divergence.

# Pseudo-code initialization
import torch
r = 16 # rank
d_in, d_out = 4096, 4096

# U: random Gaussian, scaled by 1/sqrt(r)
U = torch.randn(d_in, r) * (1.0 / math.sqrt(r))

# V: zero initialization
V = torch.zeros(d_out, r)

# At step 0: Delta_W = U @ V^T = all zeros
delta_W = U @ V.T # shape (4096, 4096), all zeros

Scaling and the Learning Rate

A critical detail: the learning rate for LoRA must be calibrated carefully. In early experiments, researchers found that a global learning rate optimal for full fine-tuning was too large for LoRA. The solution is to scale the update by alpha / r:

Delta_W = (alpha / r) * U @ V^T

where alpha is a tuning parameter (often set to 16 or 32). This scaling ensures that the magnitude of the weight update is independent of the rank, allowing consistent learning rates across different LoRA configurations.

Intuition: If you increase rank r without scaling, the update magnitude grows, and you'd need to reduce the learning rate. By scaling by alpha / r, you decouple rank from the effective learning rate. Typically, you set alpha = 2 × r (e.g., rank 8 → alpha 16, rank 16 → alpha 32).

alpha = 32
r = 16
# Effective scaling factor
scaling_factor = alpha / r # = 2.0

# Forward pass
delta_W_scaled = (scaling_factor) * U @ V.T
W_effective = W + delta_W_scaled

Why Low-Rank Works: The Intrinsic Dimension Hypothesis

The theoretical foundation rests on empirical observation: fine-tuning updates are low-rank. To understand why, consider a pre-trained model as having learned general language patterns (syntax, semantics, reasoning). Fine-tuning on a specific task tweaks only the final-layer mappings and a few intermediate representations. This adjustment is orthogonal to most of the pre-trained knowledge—you're not relearning everything, just adapting the output layer to the new task.

Singular Value Decomposition (SVD) provides a formal tool. If you compute the full fine-tuning update Delta_W_full and decompose it:

Delta_W_full = U_SVD @ Sigma @ V_SVD^T

where Sigma = diag(σ_1, σ_2, ...) and σ_1 ≥ σ_2 ≥ ... ≥ σ_n, you can truncate to rank r and recover:

Delta_W_approx = U_SVD[:, :r] @ Sigma[:r, :r] @ V_SVD[:, :r]^T

In practice, for instruction-tuning and domain adaptation tasks, the tail singular values decay rapidly: keeping the top 8–16 singular values captures 99%+ of the matrix's norm. This is why LoRA with small ranks works so well—you're not arbitrarily constraining the model; you're exploiting the natural low-rank structure.

Toy Example: Illustrating Low-Rank Approximation

Suppose the full fine-tuning update on a 100×100 weight matrix is computed (imagine this is a small projection layer). You compute its SVD and find:

Singular values: [50, 20, 5, 2, 1, 0.5, ...]

The top singular value is 50; the sum of all singular values is roughly 100. By keeping only rank 3:

delta_W_rank3_squared_norm = (50^2 + 20^2 + 5^2) = 2925
delta_W_full_squared_norm = (50^2 + 20^2 + 5^2 + 2^2 + 1^2 + ...) ≈ 2934
ratio = 2925 / 2934 ≈ 0.997 (99.7% of the update magnitude)

By using rank 3 (instead of rank 100), you've captured the essential adaptation with only 3% of the parameters. This is the empirical evidence behind LoRA.

Backpropagation Through LoRA Adapters

During the backward pass, gradients flow only through U and V:

dL/dU = (dL/d(Delta_W)) @ V  (shape: d_in x r)
dL/dV = (Delta_W)^T @ (dL/d(Delta_W)) (shape: d_out x r)

These gradient computations are efficient: even though Delta_W has shape d_in × d_out, you never explicitly form it if you rearrange the computation. Many optimized implementations (PyTorch, bitsandbytes) fuse these operations.

# Forward pass: output = (W + (alpha/r) * U @ V^T) @ input
# Simplified (ignoring alpha/r for clarity):
# output = W @ input + U @ (V^T @ input)

# Backward pass (for a loss L):
# dL/dV = U^T @ (dL/d(U @ V^T))
# dL/dU = (dL/d(U @ V^T)) @ V^T

Since you're only storing gradients for U and V (not W), memory usage is drastically reduced.

Scaling Law: Rank vs. Task Complexity

Empirically, the rank you need depends on task complexity:

TaskTypical RankReason
Instruction-following4–8High-level task-agnostic adaptation; pre-trained model already knows instructions.
Domain classification8–16Low-dimensional feature shift; same language structure, different distributions.
Code generation (domain-specific)16–32Requires learning syntax/idioms; moderate complexity.
Machine translation (low-resource)32–64Requires learning structural mappings between languages; higher intrinsic dimension.

As task complexity grows, intrinsic dimension increases. The empirical formula from Meta's research: r ≈ log2(task_complexity). In practice, you validate by training a few ranks and evaluating on a held-out set.

Multi-Head Attention and LoRA Application

In transformers, the query, key, and value projections are typically shaped (hidden_dim, hidden_dim). LoRA is applied per projection:

Q' = (W_Q + (alpha/r) * U_Q @ V_Q^T) @ input
K' = (W_K + (alpha/r) * U_K @ V_K^T) @ input
V' = (W_V + (alpha/r) * U_V @ V_V^T) @ input

Each projection gets its own adapter matrices, so total trainable parameters for attention in one layer is 3 × (d_hidden + d_hidden) × r = 6 × d_hidden × r.

Key Takeaways

  • LoRA parameterizes weight updates as a product of two low-rank matrices: Delta_W = U @ V^T, reducing parameters by 99%+.
  • Initialization of U (random) and V (zero) ensures the model starts at the pre-trained state and gradually adapts.
  • Scaling by alpha / r decouples rank from effective learning rate, ensuring consistent training dynamics across different rank choices.
  • The low-rank approximation works because task-specific updates are intrinsically low-dimensional, as shown by SVD analysis of full fine-tuning updates.
  • Rank selection depends on task complexity: rank 8 for instruction-tuning, rank 16–32 for domain adaptation, rank 32–64 for complex linguistic tasks.

Frequently Asked Questions

Why is V initialized to zero?

Initializing V = 0 ensures Delta_W = U @ 0 = 0 at the start. This means the model begins with pre-trained behavior and gradually adapts. If both U and V were random, the initial update would be noisy and potentially destabilize training. The zero-initialization trick lets the model fine-tune from a stable baseline.

What does the alpha parameter do?

Alpha scales the update magnitude by alpha / r. Without scaling, increasing rank would increase the update magnitude, requiring you to lower the learning rate. Alpha decouples rank from learning rate, so you can change rank without retuning the learning rate. Typical values: alpha = 2 × r.

How do I know if my rank is too low?

Train your model and evaluate on a validation set. If the validation loss plateaus prematurely or accuracy is significantly lower than full fine-tuning (more than 2–3%), your rank is likely too low. Increase rank and retrain.

Can I use different ranks for different layers?

Yes. Some implementations support per-layer rank configuration. However, for simplicity, most practitioners use the same rank across all LoRA layers. If you want to experiment with non-uniform ranks, you'd need to profile which layers are most task-specific (often attention layers matter more than feed-forward layers) and allocate more rank to those.

Further Reading