Skip to main content

Multi-LoRA Adapters and Mixture-of-Experts Composition

While individual LoRA adapters are powerful, combining multiple task-specific adapters into a single model enables multi-task learning with minimal extra parameters. Mixture-of-Experts (MoE) routing selects which adapter to use based on input features, and sequential composition stacks adapters for complex adaptations. This approach lets you serve customer support, code generation, and retrieval tasks simultaneously on one base model, improving resource utilization while maintaining task-specific quality. This guide covers composition strategies, routing algorithms, and practical deployment patterns for 2026 production systems.

Adapter Composition Fundamentals

In a multi-adapter setup, a single base model has multiple task-specific adapters. During inference, you must decide which adapter(s) to activate. Three strategies exist:

1. Sequential composition: Stack adapters sequentially: output of adapter 1 feeds into adapter 2.

2. Parallel composition (weighted): Combine adapter outputs with learned weights.

3. Mixture-of-Experts (MoE): Route to one or more adapters based on input features.

StrategyUse CaseLatencyParameters
SequentialCascading tasks (e.g., retrieve, then rerank)MultiplicativeAdditive
Parallel weightedBlend multiple tasks (e.g., sentiment + intent)AdditiveAdditive
MoE routingInput-conditional task selectionSimilar to singleAdditive

Let's explore each.

Sequential Composition: Stacking Adapters

Apply one adapter, then feed its output to another:

z_0 = input
z_1 = base_model(z_0) # Pre-trained knowledge
z_2 = adapter_1(z_1) # Task 1 (e.g., retrieve)
z_3 = adapter_2(z_2) # Task 2 (e.g., rerank)
output = z_3
from peft import PeftModel, get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load first adapter (e.g., retrieval optimization)
model = PeftModel.from_pretrained(base_model, "./retrieval-adapter")

# Stack second adapter on top
lora_config_2 = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM"
)

# Important: don't use get_peft_model directly; use .load_adapter()
model.load_adapter("rerank-adapter", adapter_name="rerank")

# At inference, specify which adapters to use
# (Note: PEFT's sequential composition requires manual implementation)
# Here's a simplified approach:

def sequential_forward(model, input_ids, adapter_1_name, adapter_2_name):
"""Apply adapters sequentially."""

# Set active adapter to first one
model.set_active_adapters([adapter_1_name])

# Forward pass through adapter 1
with model.disable_adapter(): # Get base output
base_output = model(input_ids, output_hidden_states=True)

# Apply adapter 1
model.set_active_adapters([adapter_1_name])
adapted_1 = model(input_ids)

# Apply adapter 2 to the output
model.set_active_adapters([adapter_2_name])
adapted_2 = model(input_ids) # This is simplified; real implementation more complex

return adapted_2

Use case: Retrieval-augmented generation where you (1) retrieve relevant documents (adapter 1), then (2) generate answer conditioned on retrieved context (adapter 2).

Parallel Composition: Weighted Ensemble

Blend multiple adapters with learned or fixed weights:

output = base_model(input) + w_1 * adapter_1(input) + w_2 * adapter_2(input)

where w_1 + w_2 = 1 (or unnormalized).

from peft import get_peft_model, LoraConfig
import torch
import torch.nn as nn

class ParallelLoRAEnsemble(nn.Module):
"""Ensemble multiple LoRA adapters with learned weights."""

def __init__(self, base_model, adapter_paths, num_adapters=2):
super().__init__()
self.base_model = base_model
self.adapters = nn.ModuleList()
self.weights = nn.Parameter(torch.ones(num_adapters) / num_adapters)

# Load each adapter
for path in adapter_paths:
from peft import PeftModel
adapter = PeftModel.from_pretrained(base_model, path)
self.adapters.append(adapter)

def forward(self, input_ids, attention_mask=None):
# Get base model output
base_output = self.base_model(input_ids, attention_mask=attention_mask)
base_logits = base_output.logits

# Get adapter outputs and blend with learned weights
adapter_logits = []
for adapter in self.adapters:
adapter_output = adapter(input_ids, attention_mask=attention_mask)
adapter_logits.append(adapter_output.logits)

# Normalize weights (softmax)
normalized_weights = torch.softmax(self.weights, dim=0)

# Blend: base + weighted sum of adapters
blended_logits = base_logits.clone()
for i, adapter_logit in enumerate(adapter_logits):
blended_logits += normalized_weights[i] * (adapter_logit - base_logits)

return type(base_output)(logits=blended_logits, hidden_states=base_output.hidden_states)

# Usage
ensemble = ParallelLoRAEnsemble(
base_model,
adapter_paths=["./customer-support-adapter", "./code-generation-adapter"],
num_adapters=2
)

# Forward pass blends both adapters
output = ensemble(input_ids)

# After training, weights settle to task importance
print(f"Adapter weights: {torch.softmax(ensemble.weights, dim=0)}")
# Output might be: [0.6, 0.4] (first adapter more important)

Use case: Multi-intent classification where you want to blend sentiment and intent predictions.

Mixture-of-Experts Routing: Dynamic Adapter Selection

A router network selects which adapter to activate based on input:

router_logits = router_network(input)
adapter_idx = argmax(router_logits)
output = adapter[adapter_idx](base_model_output)

This is the most efficient for large adapter sets because only one adapter activates per example.

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM
from peft import PeftModel

class MoELoRA(nn.Module):
"""Mixture-of-Experts LoRA router."""

def __init__(self, base_model, adapter_paths, hidden_dim=768, num_experts=3):
super().__init__()
self.base_model = base_model
self.num_experts = num_experts

# Load adapters
self.adapters = nn.ModuleList()
for path in adapter_paths[:num_experts]:
adapter = PeftModel.from_pretrained(base_model, path, is_trainable=False)
self.adapters.append(adapter)

# Router: small neural network that assigns examples to adapters
# Takes hidden state from model, outputs logits over adapters
self.router = nn.Sequential(
nn.Linear(hidden_dim, 256),
nn.ReLU(),
nn.Linear(256, num_experts)
)

def forward(self, input_ids, attention_mask=None):
# Get base model output (without any adapter)
base_output = self.base_model(
input_ids,
attention_mask=attention_mask,
output_hidden_states=True
)

# Use last hidden state to route
last_hidden_state = base_output.hidden_states[-1] # (batch_size, seq_len, hidden_dim)

# Router decides which adapter(s) to use
# For simplicity, use [CLS] token (first token) for routing
cls_hidden = last_hidden_state[:, 0, :] # (batch_size, hidden_dim)
router_logits = self.router(cls_hidden) # (batch_size, num_experts)
router_weights = torch.softmax(router_logits, dim=-1) # Soft routing

# Hard routing (pick top adapter)
expert_indices = torch.argmax(router_logits, dim=-1) # (batch_size,)

# Apply selected adapter to each example
batch_size = input_ids.shape[0]
outputs = []

for i in range(batch_size):
expert_idx = expert_indices[i].item()
adapter = self.adapters[expert_idx]

# Forward through selected adapter
example_input = input_ids[i:i+1]
example_output = adapter(example_input, attention_mask=attention_mask[i:i+1] if attention_mask is not None else None)
outputs.append(example_output.logits)

# Stack outputs
output_logits = torch.cat(outputs, dim=0)

return type(base_output)(
logits=output_logits,
hidden_states=base_output.hidden_states,
router_weights=router_weights,
expert_indices=expert_indices
)

# Usage
model = MoELoRA(
base_model,
adapter_paths=[
"./customer-support-adapter",
"./code-generation-adapter",
"./question-answering-adapter"
],
hidden_dim=4096, # Llama 2 hidden dimension
num_experts=3
)

# Forward pass: router selects adapter per example
output = model(input_ids, attention_mask=attention_mask)

print(f"Expert assignments: {output.expert_indices}")
# Output: tensor([0, 2, 1]) Example 0 → adapter 0, Example 1 → adapter 2, etc.

How routing works:

  1. Extract a representation from the model's hidden states (e.g., [CLS] token).
  2. Pass through router network (small MLP) to get logits over adapters.
  3. Select adapter with highest logit (or use soft weighting).
  4. Apply selected adapter(s) to the example.

Load Balancing in MoE

A naive router might assign all examples to the same adapter (especially early in training), wasting capacity. Add a load-balancing auxiliary loss:

def load_balancing_loss(router_logits, expert_indices, num_experts):
"""
Encourage balanced load across experts.
Loss increases if examples cluster on one expert.
"""
batch_size = router_logits.shape[0]

# Compute how many examples per expert
expert_load = torch.bincount(expert_indices, minlength=num_experts) # (num_experts,)

# Ideal load: uniform distribution
ideal_load = batch_size / num_experts

# Load balancing loss: penalize deviation from ideal
load_loss = torch.mean((expert_load - ideal_load) ** 2)

return load_loss

# In training loop:
router_logits = moe_model.router(hidden_state)
expert_indices = torch.argmax(router_logits, dim=-1)

task_loss = cross_entropy_loss(output_logits, labels)
load_loss = load_balancing_loss(router_logits, expert_indices, num_experts=3)

total_loss = task_loss + 0.01 * load_loss # Weight auxiliary loss

total_loss.backward()
optimizer.step()

The load-balancing loss encourages the router to distribute examples evenly, avoiding the collapse where all examples route to one expert.

Inference Optimization: Speculative Execution

For latency-critical deployments, use speculative execution: run all adapters in parallel and discard non-selected outputs:

class ParallelMoE(nn.Module):
"""Run all experts in parallel for lower latency (higher memory)."""

def forward(self, input_ids, attention_mask=None):
# Route
router_logits = self.router(...)
expert_indices = torch.argmax(router_logits, dim=-1)

# Run all adapters in parallel (GPU parallelism)
all_outputs = [adapter(input_ids, attention_mask) for adapter in self.adapters]
all_logits = [out.logits for out in all_outputs] # List of (batch, seq, vocab)

# Stack: (num_experts, batch, seq, vocab)
stacked_logits = torch.stack(all_logits, dim=0)

# Select based on routing: gather for each batch element
batch_size = input_ids.shape[0]
selected_logits = stacked_logits[expert_indices, torch.arange(batch_size)]

return selected_logits

This trades memory (run all adapters) for latency (all in parallel). Useful on GPUs with abundant VRAM.

Practical Multi-Task Example

A customer support system serving three tasks: intent classification, sentiment analysis, and entity extraction.

# Train three adapters (one per task)
tasks = ["intent", "sentiment", "entity"]

for task in tasks:
# Load base model
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Configure LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])

# Inject and train
model = get_peft_model(base, lora_config)

# Load task-specific training data
task_data = load_dataset(f"{task}_data.jsonl")

# Train
trainer = Trainer(model=model, args=training_args, train_dataset=task_data)
trainer.train()

# Save
model.save_pretrained(f"./{task}-adapter")

# Compose into MoE
moe_model = MoELoRA(
base_model,
adapter_paths=["./intent-adapter", "./sentiment-adapter", "./entity-adapter"],
num_experts=3
)

# At inference, router selects which adapter based on input
input_text = "I need to reset my password. This is frustrating!"
expert_id = moe_model.route(input_text)
print(f"Selected adapter: {tasks[expert_id]}") # e.g., "entity"

# Output goes through selected adapter
output = moe_model(input_ids)

Key Takeaways

  • Sequential composition: Stack adapters for cascading tasks; output of one feeds into the next.
  • Parallel composition: Blend multiple adapters with learned weights for multi-task prediction.
  • Mixture-of-Experts routing: Use a router network to select adapters dynamically; most efficient for many tasks.
  • Load balancing loss prevents all examples from routing to the same adapter.
  • Speculative execution (run all adapters in parallel) trades memory for inference latency.

Frequently Asked Questions

How many adapters can I combine before memory becomes an issue?

Depends on adapter size. For rank-16 adapters on a 7B model (~50 MB each), you can load 100+ adapters in CPU memory while keeping only one or two in VRAM for active computation. Inference remains efficient even with hundreds of adapters if using selective routing (MoE).

What if my router collapses to one expert?

Add load-balancing auxiliary loss. If it still collapses, check that the router is learning (monitor router accuracy). Alternatively, use hard routing (pick top-1 expert) instead of soft routing to force diversity.

Can I train the router jointly with adapters?

Yes. During multi-task training, backpropagate through both the router and the selected adapter(s). The router learns which adapter is best for each example.

Is sequential composition slower than single adapters?

Yes, slightly—you apply multiple adapters sequentially. Latency is roughly time_per_adapter × num_adapters. Parallel composition and MoE routing have overhead only for the router network (small), so they're closer to single-adapter speed.

Further Reading