Skip to main content

Serving Fine-Tuned Models in Production: Deployment Guide

Deploying a fine-tuned LoRA model to production requires choosing the right inference framework, optimizing for latency and throughput, and handling multi-user concurrent requests. Modern frameworks like vLLM and TGI (Text Generation Inference) natively support LoRA adapters, enabling dynamic adapter loading without reloading the base model. This guide covers deployment options, containerization with Docker, horizontal scaling, monitoring, and cost optimization strategies for production systems at scale in 2026.

Deployment Options Comparison

FrameworkLoRA SupportLatencyThroughputEaseCost
vLLMNative (v0.3+)Sub-100msVery high (batched)MediumLow (efficient scheduling)
TGI (HuggingFace)Built-in50–200msHigh (continuous batching)Easy (Docker)Low
OllamaYes (local)200–500msLow (single-threaded)Very easyLow (CPU)
TorchServeManual100–500msMediumMediumMedium (depends on hardware)
FastAPI + TransformersManual500–2000msLowEasyHigh (no optimization)
AWS SageMakerManual setup100–500msMedium–HighHardHigh (managed service)
ReplicateManual packaging1–10sLow (cold start)Easy (API)Medium

Recommendation for 2026:

  • Development & testing: Ollama (local, GPU-optional).
  • Small-to-medium production: TGI (Docker, easy, efficient).
  • Large-scale production: vLLM (highest throughput, native LoRA).
  • Edge deployment: ONNX Runtime or Ollama.

Production Setup: vLLM

vLLM is the industry standard for high-throughput LLM serving. It supports LoRA adapters natively (v0.3+):

# Install vLLM
pip install vllm>=0.3.0

# Install PEFT for LoRA
pip install peft

Serving a LoRA Model with vLLM

# serve.py
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

# Initialize LLM with base model
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=1, # Use 1 GPU; scale up for multi-GPU
dtype="float16", # Memory efficiency
gpu_memory_utilization=0.9, # Use 90% of GPU VRAM
)

# Prepare LoRA adapters
# Assumes adapters are at ./adapters/adapter-1, ./adapters/adapter-2
adapter_requests = {
"customer-support": LoRARequest(
lora_name="customer-support",
lora_int_id=1,
lora_local_path="./adapters/customer-support"
),
"code-generation": LoRARequest(
lora_name="code-generation",
lora_int_id=2,
lora_local_path="./adapters/code-generation"
),
}

# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512
)

def generate(prompt, adapter_name="customer-support"):
"""Generate text using a specific adapter."""
lora_request = adapter_requests[adapter_name]

# Generate with LoRA
outputs = llm.generate(
[prompt],
sampling_params=sampling_params,
lora_request=lora_request
)

return outputs[0].outputs[0].text

# Test
prompt = "Instruction: Help the customer reset their password."
result = generate(prompt, adapter_name="customer-support")
print(result)

REST API with vLLM

Wrap vLLM in a web server (FastAPI):

# api.py
from fastapi import FastAPI
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

app = FastAPI()

# Initialize once at startup
llm = None
adapters = {}

@app.on_event("startup")
def startup():
global llm, adapters

# Load base model
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=1,
dtype="float16",
gpu_memory_utilization=0.9,
)

# Register adapters
adapters = {
"support": LoRARequest(
lora_name="support",
lora_int_id=1,
lora_local_path="./adapters/customer-support"
),
"code": LoRARequest(
lora_name="code",
lora_int_id=2,
lora_local_path="./adapters/code-generation"
),
}

class GenerateRequest(BaseModel):
prompt: str
adapter: str = "support"
max_tokens: int = 512
temperature: float = 0.7

@app.post("/generate")
async def generate(request: GenerateRequest):
"""Generate text using a specific adapter."""

if request.adapter not in adapters:
return JSONResponse(
{"error": f"Adapter '{request.adapter}' not found"},
status_code=400
)

sampling_params = SamplingParams(
temperature=request.temperature,
max_tokens=request.max_tokens
)

try:
outputs = llm.generate(
[request.prompt],
sampling_params=sampling_params,
lora_request=adapters[request.adapter]
)

return {
"prompt": request.prompt,
"generated_text": outputs[0].outputs[0].text,
"adapter_used": request.adapter
}

except Exception as e:
return JSONResponse(
{"error": str(e)},
status_code=500
)

# Run: uvicorn api:app --host 0.0.0.0 --port 8000

Deploy with Docker:

# Dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3.11 python3-pip

WORKDIR /app

# Copy app and adapters
COPY api.py .
COPY adapters ./adapters

# Install Python packages
RUN pip install vllm peft fastapi uvicorn torch

# Expose port
EXPOSE 8000

# Run API
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run:

docker build -t lora-api:latest .

docker run --gpus all \
-p 8000:8000 \
-v $(pwd)/adapters:/app/adapters \
lora-api:latest

Production Setup: Text Generation Inference (TGI)

TGI (Hugging Face's inference server) is simpler than vLLM but still highly optimized:

# Install TGI via Docker (recommended)
docker run --gpus all \
-p 8080:80 \
-v $(pwd)/adapters:/models \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-hf \
--dtype float16 \
--max-batch-size 32 \
--max-total-tokens 4096

Client code:

# client.py
import requests
import json

HF_API_URL = "http://localhost:8080"

def generate_with_tgi(prompt, adapter_name="customer-support", max_tokens=512):
"""Query TGI server (LoRA support in newer versions)."""

payload = {
"inputs": prompt,
"parameters": {
"max_new_tokens": max_tokens,
"temperature": 0.7,
"top_p": 0.95,
}
}

response = requests.post(
f"{HF_API_URL}/generate",
json=payload,
headers={"Content-Type": "application/json"}
)

result = response.json()
return result[0]["generated_text"]

# Test
prompt = "Instruction: Reset password for user."
text = generate_with_tgi(prompt)
print(text)

Edge Deployment: Ollama

For local or edge deployments (CPU or small GPU):

# Install Ollama from https://ollama.ai

# Pull and run a model
ollama pull llama2:7b-chat

# Start Ollama server
ollama serve

# In another terminal, query
curl http://localhost:11434/api/generate -d '{
"model": "llama2:7b-chat",
"prompt": "Why is the sky blue?",
"stream": false
}'

Ollama + LoRA (community support):

# client.py using Ollama
import requests

def generate_with_ollama(prompt):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama2:7b-chat",
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]

result = generate_with_ollama("Classify: I lost my password.")
print(result)

Monitoring and Metrics

Track production performance:

# monitoring.py
from prometheus_client import Counter, Histogram, start_http_server
import time

# Metrics
request_count = Counter("lora_requests_total", "Total requests", ["adapter"])
request_latency = Histogram("lora_request_latency_seconds", "Request latency", ["adapter"])
token_count = Counter("lora_tokens_generated_total", "Tokens generated", ["adapter"])

def monitored_generate(prompt, adapter_name, llm):
"""Generate with monitoring."""
start = time.time()

try:
outputs = llm.generate([prompt], lora_request=adapters[adapter_name])
latency = time.time() - start

# Record metrics
request_count.labels(adapter=adapter_name).inc()
request_latency.labels(adapter=adapter_name).observe(latency)
token_count.labels(adapter=adapter_name).inc(len(outputs[0].outputs[0].token_ids))

return outputs[0].outputs[0].text

except Exception as e:
request_count.labels(adapter="error").inc()
raise

# Start Prometheus metrics server
start_http_server(8001)

Scrape with Prometheus:

# prometheus.yml
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'lora-api'
static_configs:
- targets: ['localhost:8001']

Cost Optimization

Reduce inference costs:

  1. Batch requests: Accumulate and batch queries for higher throughput.
  2. Quantization: Use 8-bit or 4-bit inference (QLoRA at inference time).
  3. Caching: Cache frequent queries (prompt caching, KV-cache).
  4. Adaptive batching: Adjust batch size based on latency SLA.
# Batch inference for cost-efficiency
prompts = [
"Classify: I need to reset my password.",
"Classify: How do I update my profile?",
"Classify: I lost access to my account.",
]

# Generate all at once (batched)
sampling_params = SamplingParams(temperature=0.7, max_tokens=20)
outputs = llm.generate(
prompts,
sampling_params=sampling_params,
lora_request=adapters["customer-support"]
)

results = [out.outputs[0].text for out in outputs]

Scaling: Horizontal & Vertical

Vertical scaling (single machine):

  • Use tensor parallelism: tensor_parallel_size=4 on a 4-GPU node.
  • Increase batch size to saturate GPUs.

Horizontal scaling (multiple machines):

  • Deploy multiple vLLM or TGI instances.
  • Use a load balancer (NGINX, AWS ALB) to distribute requests.
# nginx.conf
upstream lora_backend {
server localhost:8000;
server localhost:8001;
server localhost:8002;
}

server {
listen 80;
location /generate {
proxy_pass http://lora_backend;
}
}

Key Takeaways

  • vLLM is the gold standard for production LoRA serving: native adapter support, highest throughput, easy scaling.
  • TGI (Text Generation Inference) is simpler and excellent for moderate scale; Docker-based deployment.
  • Ollama is ideal for local/edge deployments; minimal setup.
  • Monitor latency, throughput, and token costs; use batching and caching to optimize.
  • Scale horizontally by deploying multiple instances behind a load balancer.

Frequently Asked Questions

How do I add a new adapter to production without restarting?

With vLLM, adapters are loaded on-demand from disk. Simply add the adapter directory and reference it in requests. No restart required.

What's the typical latency for LoRA inference?

  • vLLM (batched): 50–100ms time-to-first-token, 10–50ms per additional token.
  • TGI: 100–200ms first-token, 15–50ms per token.
  • TorchServe: 200–500ms first-token.
  • Single example (non-batched): 2–5s first-token.

Latency depends on model size, adapter rank, and batch size.

Can I serve both merged and separate adapters?

Yes. Merged models are standard Hugging Face checkpoints (deploy normally). Separate adapters require vLLM or TGI. You can run both on different endpoints if needed.

How do I handle cold starts?

vLLM and TGI pre-load models at startup. For serverless (AWS Lambda, Google Cloud Functions), cold start is 20–60 seconds. Pre-warm instances or use dedicated containers for consistent latency.

What's the maximum number of adapters I can serve simultaneously?

Depends on adapter size and available GPU memory. For rank-16 adapters (~50 MB each), you can store 100+ in CPU RAM while keeping 1–2 in VRAM for active computation. vLLM handles dynamic loading transparently.

Further Reading