Horizontal Autoscaling LLM Workloads on Kubernetes
Horizontal autoscaling on Kubernetes is how you scale LLM workloads from handling 10 requests/sec to 100,000 requests/sec without manual intervention. Kubernetes Horizontal Pod Autoscaler (HPA) monitors metrics (CPU, memory, or custom metrics like queue depth), and when demand spikes, it spawns new replicas automatically. With proper metrics and scaling policies, your system handles 100x traffic surge in minutes and scales back down to save costs when demand drops. This pattern powers every cloud-scale LLM service.
How Kubernetes HPA Works
HPA operates on three inputs: current metric value, desired metric value, and current replica count. It calculates: new_replicas = current_replicas * (current_metric / desired_metric). For example:
- Desired CPU: 70% (per-pod target).
- Current CPU: 140% (pods overloaded).
- Current replicas: 5.
- New replicas: 5 * (140% / 70%) = 10 pods.
HPA checks every 15 seconds by default and scales up/down gradually to avoid oscillation.
| Metric Source | Use Case | Scaling Speed |
|---|---|---|
| CPU/Memory | General workloads | 15–30 seconds |
| Custom metrics (Prometheus) | LLM-specific (queue depth, latency) | 15–30 seconds |
| External metrics (KEDA) | Queue systems (RabbitMQ, Kafka, Redis) | 5–30 seconds |
CPU-Based Autoscaling (Simplest)
CPU-based HPA is the easiest to set up but less accurate for I/O-bound LLM workloads. Here's a basic deployment with HPA:
---
# Deployment: LLM worker pods
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-worker
spec:
replicas: 3 # Start with 3; HPA scales up/down from here
selector:
matchLabels:
app: llm-worker
template:
metadata:
labels:
app: llm-worker
spec:
containers:
- name: worker
image: myregistry.azurecr.io/llm-worker:latest
resources:
requests:
cpu: "1" # Reserve 1 CPU per pod
memory: "1Gi"
limits:
cpu: "2" # Max 2 CPUs per pod
memory: "2Gi"
env:
- name: WORKER_POOL_SIZE
value: "4" # 4 concurrent LLM requests per pod
---
# HPA: Scale pods based on CPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-worker
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale up if avg CPU > 70%
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 30 # Quick scale-up
policies:
- type: Percent
value: 100
periodSeconds: 15
Deploy with:
kubectl apply -f llm-worker-deployment.yaml
kubectl apply -f llm-worker-hpa.yaml
# Monitor scaling.
kubectl get hpa -w
kubectl top pods -l app=llm-worker
CPU-based scaling works for batch processing but fails for bursty I/O workloads (LLM APIs) where pods wait idle on network calls.
Queue-Depth Based Autoscaling with KEDA
For LLM workloads, queue depth is a better metric. KEDA (Kubernetes Event-Driven Autoscaling) scales based on queue length:
---
# Install KEDA (one-time):
# helm repo add kedacore https://kedacore.github.io/charts
# helm install keda kedacore/keda --namespace keda --create-namespace
---
# ScaledObject: scale based on Redis queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-worker-scaled
spec:
scaleTargetRef:
name: llm-worker # Scale this Deployment
minReplicaCount: 3
maxReplicaCount: 100
triggers:
- type: redis
metadata:
address: redis-service:6379 # Redis queue
listName: llm_requests # Key name
listLength: "30" # Target 30 items per replica
databaseIndex: "0"
- type: prometheus
metadata:
serverAddress: prometheus-service:9090
query: |
rate(llm_requests_total[1m])
threshold: "100" # Scale to handle 100 req/sec
With this setup, KEDA monitors Redis list length and automatically scales:
- Queue length 300 items, 30 target per replica: spawn 10 pods.
- Queue empties: scale back to 3 pods.
Custom Prometheus Metrics for Advanced Scaling
For fine-grained control, define custom metrics (request latency, error rate) and scale based on them:
---
# Deployment emits Prometheus metrics
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-worker
spec:
template:
spec:
containers:
- name: worker
image: myregistry.azurecr.io/llm-worker:latest
ports:
- name: metrics
containerPort: 8000 # Prometheus metrics endpoint
---
# ServiceMonitor: tell Prometheus to scrape this app
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: llm-worker
spec:
selector:
matchLabels:
app: llm-worker
endpoints:
- port: metrics
interval: 15s
---
# HPA: scale based on custom metric (P99 latency)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-worker-hpa-custom
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-worker
minReplicas: 3
maxReplicas: 100
metrics:
- type: Pods
pods:
metric:
name: llm_request_latency_p99_ms
target:
type: AverageValue
averageValue: "1000" # Keep P99 latency < 1 second
- type: Pods
pods:
metric:
name: llm_queue_depth_per_pod
target:
type: AverageValue
averageValue: "50" # Keep 50 pending requests per pod
In Python, export these metrics from your worker:
from prometheus_client import Counter, Gauge, Histogram
import time
request_count = Counter('llm_requests_total', 'Total requests')
queue_depth = Gauge('llm_queue_depth_per_pod', 'Queue depth')
latency = Histogram('llm_request_latency_ms', 'Request latency', buckets=[10, 50, 100, 500, 1000, 5000])
async def process_request(prompt: str) -> str:
start = time.time()
try:
result = await fetch_llm_response(prompt)
request_count.inc()
return result
finally:
latency.observe((time.time() - start) * 1000)
# Update queue depth periodically.
async def monitor_queue():
while True:
depth = len(worker_queue)
queue_depth.set(depth)
await asyncio.sleep(5)
With custom metrics, HPA scales predictively: before latency explodes, it scales up based on queue depth.
Cost-Aware Autoscaling with Spot Instances
Reduce costs by using Kubernetes node autoscaler with spot instances:
---
# NodePool: use cheap spot instances (interruptible, ~70% discount)
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
name: llm-worker-pool
spec:
cluster:
name: my-cluster
initialNodeCount: 0
autoscaling:
minNodeCount: 1
maxNodeCount: 50
nodeConfig:
machineType: n2-standard-4 # 4 CPUs, 16GB RAM
preemptible: true # Spot instances (cheaper)
labels:
workload: llm-worker
management:
autoRepair: true
autoUpgrade: true
With spot instances, your per-node cost drops from $200/month to $60/month, while Kubernetes handles interruptions gracefully via pod eviction and rescheduling.
Vertical Autoscaling for Right-Sizing
VPA (Vertical Pod Autoscaler) adjusts CPU/memory requests based on actual usage, preventing over-provisioning:
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: llm-worker-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: llm-worker
updatePolicy:
updateMode: "Auto" # Automatically update requests
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: "500m"
memory: "512Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
VPA runs for 2–3 days, observes actual usage, and recommends new requests. Enabling Auto mode restarts pods with optimized requests.
Monitoring and Alerts for Scaling
Monitor scaling events and alert on failures:
---
# Alert: HPA unable to scale
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: llm-hpa-alerts
spec:
groups:
- name: autoscaling
interval: 30s
rules:
- alert: HPAMaxedOut
expr: |
kube_hpa_status_current_replicas{hpa="llm-worker-hpa"}
== kube_hpa_status_desired_replicas{hpa="llm-worker-hpa"}
for: 10m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.hpa }} at max replicas ({{ $value }}). Traffic spike unhandled."
- alert: HPAScalingErrors
expr: rate(hpa_scaling_activity_duration_seconds_count{hpa="llm-worker-hpa", status="failure"}[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "HPA {{ $labels.hpa }} failing to scale. Check node capacity."
Key Takeaways
- HPA scales automatically based on metrics: CPU-based for compute; queue depth for I/O.
- KEDA enables external metrics (Redis, Kafka, RabbitMQ): Ideal for LLM queue-driven workloads.
- Custom Prometheus metrics enable predictive scaling: Scale before latency spikes, not after.
- Spot instances + node autoscaling reduce costs 60–70%: Kubernetes handles interruptions automatically.
- VPA right-sizes resource requests: Prevent over-provisioning and wasted spending.
Frequently Asked Questions
Should I use HPA with CPU or queue depth?
Queue depth is more accurate for I/O-bound LLM work. HPA + CPU works for bursty compute tasks (batch embeddings). Use both: CPU for safety, queue for primary scaling signal.
How long does HPA take to scale up?
15–30 seconds from metric spike to new pods running. Add 30–60 seconds for image pull and startup. Total: 1–2 minutes. For faster response, use KEDA with external metrics (5–30 seconds).
What if nodes are full and HPA can't scale?
Cluster autoscaler should spawn new nodes automatically. If it doesn't, check: (1) max node count limit reached, (2) insufficient cloud quota, (3) node capacity constraints (taints, tolerations mismatch).
Can I use HPA with spot instances safely?
Yes. Spot interruptions trigger pod eviction, which HPA retries. But if you're at max nodes and spot instances get evicted, requests queue up. Use a mix of spot (90%) and on-demand (10%) for stability.
What's the cost of autoscaling infrastructure (HPA, KEDA, monitoring)?
Negligible: HPA is built-in; KEDA adds ~200 MB; Prometheus monitoring adds ~500 MB. Focus on pod costs, not infrastructure.
Further Reading
- Kubernetes HPA Documentation — official guide to scaling policies.
- KEDA Scalers — list of supported queue types (Redis, RabbitMQ, Kafka, AWS SQS).
- Vertical Pod Autoscaler — automatic resource optimization.
- Spot Instance Best Practices — cost-effective scaling.