Skip to main content

QLoRA: Quantized LoRA for 70B Models on Consumer GPUs

QLoRA (Quantized LoRA) extends LoRA by combining it with 4-bit quantization, enabling you to fine-tune 70-billion-parameter models on a single 24 GB GPU—something infeasible even with LoRA alone. QLoRA quantizes the base model to 4-bit precision, keeping high-precision adapters, and uses gradient checkpointing and paged optimizers to fit training within consumer hardware. This combination achieves 97–99% of full fine-tuning quality while reducing memory from 780 GB (full) to 20 GB, making open-source model customization accessible to everyone.

Why Quantization Matters for LoRA

LoRA alone reduces trainable parameters to 0.2%, but the base model W still occupies significant memory. A 7B-parameter model in float32 is 28 GB; Llama 2–70B is 280 GB. Even though you're not training W, you must load it into memory for inference during the forward pass. Quantization reduces the precision of W from float32 (32 bits per weight) to lower precision formats, shrinking its footprint by 8–16× without substantially degrading task performance.

Quantization concept: Instead of storing every weight as a 32-bit float, represent it as a lower-precision integer (e.g., 4-bit, 2–16 range). To reconstruct the original value during inference, multiply by a scaling factor. Properly calibrated quantization preserves 98–99% of model quality because neural networks are surprisingly robust to precision loss in static layers (non-trainable).

4-Bit Quantization: The Sweet Spot

The LoRA+Quantization combination uses 4-bit integer quantization. A 4-bit weight can represent 16 values (0–15). The quantization formula is:

W_quantized = round(W_float32 / scale_factor)
W_reconstructed = W_quantized.astype(float32) * scale_factor

The scale factor is computed per block (e.g., per 64 weights) to minimize quantization error:

scale_factor = max(abs(W)) / 15  (since 4-bit max value is 15)

For a 70B model in 4-bit, memory drops from 280 GB (float32) to 35 GB (4-bit), and with LoRA adapters taking ~500 MB, total memory is ~36 GB. During training, you also need space for:

  • Optimizer states (8–10 GB for Adam)
  • Activations and gradients (4–8 GB)
  • LoRA adapter gradients (0.5 GB)

With gradient checkpointing (recomputing activations instead of storing them) and paged optimizers (offloading optimizer states to CPU RAM), the total fits in 24 GB.

Symmetric vs. Asymmetric Quantization

Two approaches exist:

Symmetric 4-bit: Map the weight range [-max_val, max_val] linearly to 4-bit integers [-8, 7]. Simple, fast, but slightly less precise if weights are skewed.

Asymmetric quantization: Map the true range [min_val, max_val] to [0, 15] (unsigned 4-bit). Slightly more accurate, commonly used in production.

For QLoRA, bitsandbytes uses a hybrid approach: 4-bit quantization of weights with a learned 32-bit scaling factor per group, balancing efficiency and accuracy.

Double Quantization: Quantizing the Quantizer

QLoRA introduces an innovation: the scale factors themselves are quantized (to 8-bit). Here's why: for a 70B model, storing one float32 scale factor per 64 weights adds significant overhead. By quantizing the scale factors to 8-bit, you further reduce memory with minimal accuracy loss:

W_quantized = round(W / scale_factor)
scale_quantized = quantize_8bit(scale_factor) # 8-bit representation
W_reconstructed = W_quantized * dequantize_8bit(scale_quantized)

This is called double quantization and saves ~0.5 GB more memory on a 70B model.

The QLoRA Training Recipe

Here's the end-to-end workflow:

  1. Load the base model in 4-bit quantization (via bitsandbytes).
  2. Inject LoRA adapter layers into attention and feed-forward projections.
  3. Keep adapters in float32 (trainable, high precision).
  4. Use gradient checkpointing to avoid storing activations.
  5. Use a paged optimizer to offload optimizer states to CPU RAM as needed.
  6. Train using standard PyTorch gradient descent, backpropagating only through adapters.
# Pseudocode (full code in Article 4)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig

# 1. Load model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True, # Double quantization
bnb_4bit_quant_type="nf4" # 4-bit NormalFloat
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto"
)

# 2. Apply LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)

# 3. Prepare for training (gradient checkpointing, paged optimizer)
model.gradient_checkpointing_enable()

# 4. Train normally (gradients flow only through LoRA)
trainer = Trainer(model=model, args=training_args, train_dataset=train_data)
trainer.train()

Quantization Formats: NF4 vs. INT4

Bitsandbytes supports two 4-bit formats:

NF4 (NormalFloat): A custom 4-bit format optimized for weights drawn from a normal distribution (true for trained neural networks). It allocates bits non-uniformly, placing more representable values near zero (where weight distributions peak). Slightly better quality than int4.

INT4: Standard 4-bit two's complement integers. Simpler, slightly faster, but less accurate for skewed weight distributions.

For QLoRA, NF4 is recommended (and the default in bitsandbytes). The quality difference is ~0.3% on downstream tasks.

Memory Breakdown: 70B Model with QLoRA

ComponentSize (float32)Size (QLoRA)
Base model (70B, 2 tok/param)280 GB35 GB (4-bit)
LoRA adapters (rank 16, ~4 layers)0.5 GB
Optimizer states (Adam)560 GB8 GB (paged to CPU)
Gradients (LoRA only)0.5 GB
Activations (batch 4)120 GB2 GB (gradient checkpointing)
Total960 GB46 GB / 24 GB (with paging)

With aggressive paging, training fits in 24 GB VRAM + 16 GB host RAM.

Practical Trade-offs

SettingSpeedMemoryQuality
LoRA (rank 16, no quantization)1.0x (baseline)45 GB99%
QLoRA (4-bit, rank 16, no gradient checkpointing)0.95x28 GB98.5%
QLoRA (4-bit, rank 16, gradient checkpointing)0.85x20 GB98.5%
Full fine-tuning (baseline)1.5x800 GB100%

Speed loss is due to dequantizing weights on-the-fly during the forward pass. Modern GPUs (RTX 4090, H100) mitigate this with fused kernels, bringing overhead to 5–15%.

Inference: When to Merge Quantization

After training, you have choices for inference:

  1. Keep model quantized + load adapter: Store the 35 GB quantized base + 0.5 GB adapter; load both at inference. Slower inference (dequantization overhead), but compact.
  2. Merge adapter into quantized model: Merge the float32 adapter into the quantized base, keeping the model quantized. Risk of slight accuracy loss from re-quantizing the merged weights.
  3. Dequantize and merge: Dequantize the base to float32, merge the adapter, and save the full 70B model. Fast inference, but 280 GB storage.

For most applications, option 1 (keep separate) or option 2 (merge in 4-bit) is preferred, trading inference speed for storage and loading efficiency.

Key Takeaways

  • QLoRA combines 4-bit quantization with LoRA to fit 70B-parameter model training in 24 GB VRAM.
  • Double quantization quantizes both weights and scaling factors, further reducing memory overhead.
  • NF4 (NormalFloat 4-bit) is the recommended format, achieving 98–99% of full fine-tuning quality.
  • Gradient checkpointing and paged optimizers reduce peak memory usage by offloading non-critical data to CPU RAM.
  • Training speed is 5–15% slower than float32 LoRA due to weight dequantization, but the hardware accessibility trade-off is net-positive.

Frequently Asked Questions

Is 4-bit quantization lossy?

Yes, quantization is lossy by design. However, neural networks are robust to precision loss in static (frozen) layers. Empirically, 4-bit quantization of the base model loses only 1–2% of downstream task quality. The key is that you keep adapters in float32, so the learned task-specific updates are high-precision.

Why not use 8-bit or 2-bit quantization instead of 4-bit?

8-bit reduces memory less (less efficiency gain). 2-bit is more aggressive but loses significant quality (96–97% of full fine-tuning), making the trade-off less appealing. 4-bit is the sweet spot: strong efficiency gain (8×) with minimal quality loss (1–2%).

Can I use QLoRA with other model families (BERT, vision transformers)?

Yes, QLoRA is agnostic to model architecture. It works with any transformer-based model. Bitsandbytes supports quantization for LLaMA, Mistral, GPT, BERT, and others. However, the memory savings are most dramatic for decoder-only LLMs (Llama, GPT-style) because they have large activation footprints during autoregressive generation.

What hardware do I need for QLoRA?

Minimum: a single 24 GB GPU (RTX 4090, RTX 6000 Ada, L40, A5000). Recommended: 40+ GB for stable training with larger batch sizes and longer sequences. CPU RAM requirement is typically 16–32 GB for paged optimizer offloading.

Further Reading