LLM Inference on Edge & Constrained Hardware
Deploying LLMs on edge devices—smartphones, IoT gateways, embedded systems, and personal computers without GPUs—requires a different optimization approach than data-center serving. Edge inference trades latency for reduced resource consumption: a 7B model on a mobile phone generates 5-10 tokens per second (vs. 100+ on a GPU) but enables offline, private inference with sub-second TTFT for short queries. This article covers quantization strategies, efficient runtimes, and architectural patterns for edge LLM deployment.
Edge Inference Constraints and Tradeoffs
Edge devices have severe constraints: 4-16 GB RAM (mobile), 100 GB+ CPU cache misses vs. GPU cache bandwidth, no persistent internet connection. The fundamental constraint: model weights must fit in RAM. A 7B-parameter model in FP16 weighs 14 GB; most phones have 8-12 GB total. Edge inference requires aggressive quantization.
| Device | RAM | Storage | Target Model Size | Precision |
|---|---|---|---|---|
| iPhone 15 Pro | 8 GB | 256 GB | < 8 GB | INT4 (2x-4x compression) |
| Android flagship | 12 GB | 256 GB | < 12 GB | INT4 or INT8 |
| Raspberry Pi 5 | 4-8 GB | 32 GB microSD | < 3 GB | INT4 (1B-3B model) |
| Laptop (no GPU) | 16 GB | 512 GB | < 12 GB | INT8 or FP16 |
| Embedded gateway | 2-4 GB | 4-32 GB eMMC | < 2 GB | INT4 (< 1B model) |
The inference speed tradeoff: edge inference is 10-50x slower than GPU inference. A 7B model generates 5-10 tokens/sec on mobile (Intel/ARM CPU) vs. 100-150 tokens/sec on an A100 GPU. This is acceptable for streaming responses (the user sees tokens arriving) but unsuitable for very long-form generation.
Quantization Strategies for Edge
INT4 Quantization with GGUF Format
GGUF (GPT-Generated Unified Format) is the de facto standard for edge inference. GGUF models are quantized to INT4/INT8, optimized for CPU inference, and extremely compact.
from llama_cpp import Llama
# Load a GGUF-quantized model (typically 2-4 GB for 7B)
llm = Llama(
model_path="./models/llama-2-7b-q4_k_m.gguf", # INT4, 4.5 GB
n_gpu_layers=-1, # Use GPU if available, else CPU
n_threads=8, # Use 8 CPU threads
n_batch=512, # Process 512 tokens per batch
n_ctx=2048, # 2K context window
verbose=False, # Reduce logging
)
# Inference is identical to GPU-based inference
response = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
echo=False
)
print(response['choices'][0]['text'])
GGUF quantization variants (listed by file size and speed on CPU):
- Q8_0: 8-bit, ~90% of FP16 quality, ~12 GB for 7B (slower, higher quality).
- Q6_K: 6-bit, ~95% quality, ~8 GB for 7B (good balance).
- Q5_K_M: 5-bit, ~98% quality, ~7 GB for 7B (recommended for mobile).
- Q4_K_M: 4-bit, minimal quality loss < 1%, ~4.5 GB for 7B (best compression).
- Q3_K: 3-bit, noticeable quality loss (~2%), ~3 GB for 7B (extremely compact).
For mobile, Q4_K_M is standard. For IoT, Q3_K or even lower. For desktops, Q5_K_M if you want negligible quality loss.
Computing GGUF Models
To quantize an existing model to GGUF format (requires a one-time 10-30 minute process):
# Clone llama.cpp (reference implementation)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build
make
# Convert model to GGUF format
python3 convert.py /path/to/llama-2-7b --outfile model.gguf
# Quantize to Q4_K_M
./quantize model.gguf model-q4_k_m.gguf Q4_K_M
# Result: ~4.5 GB file
Once quantized, distribute the single .gguf file to edge devices (no further conversion needed).
Edge Runtime Options
Option 1: llama.cpp (CPU + optional GPU)
llama.cpp is the most popular open-source edge LLM runtime. It supports:
- CPU inference on any OS (macOS, Linux, Windows, mobile).
- GPU offloading (CUDA, Metal, ROCm if available).
- Quantized models (GGUF format).
Usage:
./main -m model-q4_k_m.gguf -n 256 -p "What is AI?"
Option 2: Ollama (Simplified Wrapper)
Ollama provides a simpler interface over llama.cpp:
import requests
import json
# Download: ollama pull llama2
# Run: ollama serve (in background)
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama2',
'prompt': 'What is AI?',
'stream': False
})
print(json.loads(response.text)['response'])
Option 3: Mobile-Specific Runtimes
- CoreML (iOS): Apple's optimized runtime for on-device ML. Supports quantized models and hardware acceleration (Neural Engine on A-series chips).
- ONNX Runtime Mobile (iOS/Android): Open standard for model inference, supports quantization and various hardware accelerators.
- TensorFlow Lite (Android, iOS): TensorFlow's optimized runtime for mobile, very mature and widely used.
Performance: CPU vs. GPU Edge Inference
Real-world latencies on common devices:
| Device | Model (precision) | TTFT | TPS | Decode Time (256 tokens) |
|---|---|---|---|---|
| iPhone 15 (A17 Pro) | Llama 2 7B (Q4_K_M) | 800ms | 7 | 37s |
| MacBook Pro M3 | Llama 2 7B (Q4_K_M) | 400ms | 12 | 21s |
| Raspberry Pi 5 | Llama 2 3B (Q4) | 1500ms | 3 | 90s |
| Laptop (Intel i9, CPU) | Llama 2 7B (Q5_K_M) | 600ms | 10 | 26s |
| Laptop (NVIDIA RTX 4060) | Llama 2 7B (FP16) | 200ms | 80 | 3.2s |
For comparison, a GPU is 5-10x faster, but edge inference is acceptable for interactive use cases where latency under 1 second is tolerable.
Architectural Pattern: Hybrid Local + Server
A common pattern for edge deployment: use a smaller local model for fast response, offload complex queries to a server:
import asyncio
from llama_cpp import Llama
import httpx
class HybridLLMRouter:
"""
Route requests to local edge model or cloud server
based on complexity and latency budget.
"""
def __init__(self, local_model_path: str, server_url: str = None):
self.local_model = Llama(model_path=local_model_path, n_threads=8)
self.server_url = server_url
async def generate(self, prompt: str, latency_budget_ms: int = 2000):
"""
Generate response: use local model if it can meet latency budget,
otherwise offload to server.
"""
# Estimate local inference time
# Simple heuristic: prompt_length_tokens / 2 tokens_per_sec * 1000ms
import time
estimated_local_time = len(prompt.split()) / 2 * 1000 # ~2 tokens/sec rough estimate
if estimated_local_time < latency_budget_ms and not self._server_is_available():
# Use local model
print("Using local edge model...")
response = self.local_model(
prompt,
max_tokens=256,
temperature=0.7
)
return response['choices'][0]['text']
else:
# Offload to server
if self.server_url:
print("Offloading to server...")
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{self.server_url}/v1/completions",
json={"prompt": prompt, "max_tokens": 256}
)
return resp.json()['text']
else:
# No server available, fall back to slow local inference
return self.local_model(prompt, max_tokens=256)['choices'][0]['text']
def _server_is_available(self) -> bool:
"""Check if server is reachable."""
if not self.server_url:
return False
try:
import httpx
response = httpx.get(f"{self.server_url}/v1/health", timeout=1)
return response.status_code == 200
except:
return False
# Usage
router = HybridLLMRouter(
local_model_path="./llama-2-3b-q4.gguf", # Fast 3B model locally
server_url="https://api.example.com" # 7B model on server
)
# Local network or no latency budget → use local
response1 = asyncio.run(router.generate("What is Python?", latency_budget_ms=500))
print(response1)
# Latency-sensitive or complex → offload to server
response2 = asyncio.run(router.generate(
"Explain the implications of quantum computing on cryptography.",
latency_budget_ms=200
))
print(response2)
This pattern ensures low latency (local model) while maintaining capability (complex queries to server).
Mobile Deployment: iOS Example
Deploying a quantized LLM on iOS using ONNX Runtime:
import Foundation
import OnnxRuntime
class EdgeLLMViewModel: ObservableObject {
@Published var response: String = ""
private let ortSession: ORTSession
init(modelPath: String) throws {
let ortEnv = try ORTEnv(loggingLevel: .warning)
let sessionOptions = try ORTSessionOptions()
try sessionOptions.setGraphOptimizationLevel(.extended)
self.ortSession = try ORTSession(
env: ortEnv,
modelPath: modelPath,
sessionOptions: sessionOptions
)
}
func generate(prompt: String) async {
let tokens = tokenize(prompt)
var outputTokens: [Int] = []
for _ in 0..<256 {
// Run inference
let inputTensor = try! ORTValue(
tensorData: NSMutableData(bytes: tokens, length: tokens.count * MemoryLayout<Int32>.size),
elementType: .int32,
shape: NSNumber(value: tokens.count)
)
let outputs = try! ortSession.run(
withInputs: ["input_ids": inputTensor],
outputNames: ["logits"],
runOptions: nil
)
let logits = outputs["logits"] as! ORTValue
let nextToken = argmax(logits)
outputTokens.append(nextToken)
}
DispatchQueue.main.async {
self.response = self.detokenize(outputTokens)
}
}
private func tokenize(_ text: String) -> [Int32] {
// Placeholder: real tokenization using BPE
return text.split(separator: " ").map { Int32($0.hashValue) }
}
private func detokenize(_ tokens: [Int]) -> String {
// Placeholder: reverse tokenization
return tokens.map { String($0) }.joined(separator: " ")
}
private func argmax(_ logits: ORTValue) -> Int {
// Get max probability token
return 0 // Placeholder
}
}
Key Takeaways
- Edge inference requires INT4 quantization: 4.5 GB for a 7B model on mobile.
- GGUF format is the standard: llama.cpp and Ollama support it natively.
- CPU inference is 10x slower than GPU: 7 tokens/sec (mobile) vs. 100+ (GPU).
- Hybrid local + server architecture: Use edge model for fast responses, offload complex queries to server.
- Acceptable for interactive use: Sub-second TTFT and streaming tokens make slow inference feel responsive.
Frequently Asked Questions
Can I run a 70B model on a mobile phone?
No. A 70B model in INT4 is ~18 GB, exceeding phone storage. You could run a 3B model instead, which offers 70% of capability at 10% of size. For 70B, use server-side inference or hybrid (local 3B + server 70B).
Why is CPU inference so much slower than GPU?
GPUs have thousands of cores (massively parallel); CPUs have 8-16 cores. GPUs are optimized for matrix multiplication; CPUs use general-purpose silicon. The speedup from quantization (INT4 kernels are faster) is real but cannot overcome the core count difference.
Does quantization degrade output quality on edge?
Q4_K_M (4-bit) typically causes < 0.5% perplexity increase on standard benchmarks, which is unnoticeable in practice. Q3_K (3-bit) causes ~2% increase, which may be noticeable on reasoning tasks but fine for chat.
Can I use batching on edge (multiple queries simultaneously)?
Technically yes, but rarely practical. Batching assumes many requests arrive concurrently (like a server). Edge devices run one user's query at a time. Batching would increase latency without benefit.
How do I keep edge models updated (security patches, new models)?
Over-the-air (OTA) updates: periodically download new GGUF files (4-8 GB) via WiFi and background downloads (e.g., iOS Background App Refresh). Use version pinning and checksum validation to prevent supply-chain attacks.