Skip to main content

LLM Inference on Edge & Constrained Hardware

Deploying LLMs on edge devices—smartphones, IoT gateways, embedded systems, and personal computers without GPUs—requires a different optimization approach than data-center serving. Edge inference trades latency for reduced resource consumption: a 7B model on a mobile phone generates 5-10 tokens per second (vs. 100+ on a GPU) but enables offline, private inference with sub-second TTFT for short queries. This article covers quantization strategies, efficient runtimes, and architectural patterns for edge LLM deployment.

Edge Inference Constraints and Tradeoffs

Edge devices have severe constraints: 4-16 GB RAM (mobile), 100 GB+ CPU cache misses vs. GPU cache bandwidth, no persistent internet connection. The fundamental constraint: model weights must fit in RAM. A 7B-parameter model in FP16 weighs 14 GB; most phones have 8-12 GB total. Edge inference requires aggressive quantization.

DeviceRAMStorageTarget Model SizePrecision
iPhone 15 Pro8 GB256 GB< 8 GBINT4 (2x-4x compression)
Android flagship12 GB256 GB< 12 GBINT4 or INT8
Raspberry Pi 54-8 GB32 GB microSD< 3 GBINT4 (1B-3B model)
Laptop (no GPU)16 GB512 GB< 12 GBINT8 or FP16
Embedded gateway2-4 GB4-32 GB eMMC< 2 GBINT4 (< 1B model)

The inference speed tradeoff: edge inference is 10-50x slower than GPU inference. A 7B model generates 5-10 tokens/sec on mobile (Intel/ARM CPU) vs. 100-150 tokens/sec on an A100 GPU. This is acceptable for streaming responses (the user sees tokens arriving) but unsuitable for very long-form generation.

Quantization Strategies for Edge

INT4 Quantization with GGUF Format

GGUF (GPT-Generated Unified Format) is the de facto standard for edge inference. GGUF models are quantized to INT4/INT8, optimized for CPU inference, and extremely compact.

from llama_cpp import Llama

# Load a GGUF-quantized model (typically 2-4 GB for 7B)
llm = Llama(
model_path="./models/llama-2-7b-q4_k_m.gguf", # INT4, 4.5 GB
n_gpu_layers=-1, # Use GPU if available, else CPU
n_threads=8, # Use 8 CPU threads
n_batch=512, # Process 512 tokens per batch
n_ctx=2048, # 2K context window
verbose=False, # Reduce logging
)

# Inference is identical to GPU-based inference
response = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
echo=False
)

print(response['choices'][0]['text'])

GGUF quantization variants (listed by file size and speed on CPU):

  • Q8_0: 8-bit, ~90% of FP16 quality, ~12 GB for 7B (slower, higher quality).
  • Q6_K: 6-bit, ~95% quality, ~8 GB for 7B (good balance).
  • Q5_K_M: 5-bit, ~98% quality, ~7 GB for 7B (recommended for mobile).
  • Q4_K_M: 4-bit, minimal quality loss < 1%, ~4.5 GB for 7B (best compression).
  • Q3_K: 3-bit, noticeable quality loss (~2%), ~3 GB for 7B (extremely compact).

For mobile, Q4_K_M is standard. For IoT, Q3_K or even lower. For desktops, Q5_K_M if you want negligible quality loss.

Computing GGUF Models

To quantize an existing model to GGUF format (requires a one-time 10-30 minute process):

# Clone llama.cpp (reference implementation)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build
make

# Convert model to GGUF format
python3 convert.py /path/to/llama-2-7b --outfile model.gguf

# Quantize to Q4_K_M
./quantize model.gguf model-q4_k_m.gguf Q4_K_M

# Result: ~4.5 GB file

Once quantized, distribute the single .gguf file to edge devices (no further conversion needed).

Edge Runtime Options

Option 1: llama.cpp (CPU + optional GPU)

llama.cpp is the most popular open-source edge LLM runtime. It supports:

  • CPU inference on any OS (macOS, Linux, Windows, mobile).
  • GPU offloading (CUDA, Metal, ROCm if available).
  • Quantized models (GGUF format).

Usage:

./main -m model-q4_k_m.gguf -n 256 -p "What is AI?"

Option 2: Ollama (Simplified Wrapper)

Ollama provides a simpler interface over llama.cpp:

import requests
import json

# Download: ollama pull llama2
# Run: ollama serve (in background)

response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama2',
'prompt': 'What is AI?',
'stream': False
})

print(json.loads(response.text)['response'])

Option 3: Mobile-Specific Runtimes

  • CoreML (iOS): Apple's optimized runtime for on-device ML. Supports quantized models and hardware acceleration (Neural Engine on A-series chips).
  • ONNX Runtime Mobile (iOS/Android): Open standard for model inference, supports quantization and various hardware accelerators.
  • TensorFlow Lite (Android, iOS): TensorFlow's optimized runtime for mobile, very mature and widely used.

Performance: CPU vs. GPU Edge Inference

Real-world latencies on common devices:

DeviceModel (precision)TTFTTPSDecode Time (256 tokens)
iPhone 15 (A17 Pro)Llama 2 7B (Q4_K_M)800ms737s
MacBook Pro M3Llama 2 7B (Q4_K_M)400ms1221s
Raspberry Pi 5Llama 2 3B (Q4)1500ms390s
Laptop (Intel i9, CPU)Llama 2 7B (Q5_K_M)600ms1026s
Laptop (NVIDIA RTX 4060)Llama 2 7B (FP16)200ms803.2s

For comparison, a GPU is 5-10x faster, but edge inference is acceptable for interactive use cases where latency under 1 second is tolerable.

Architectural Pattern: Hybrid Local + Server

A common pattern for edge deployment: use a smaller local model for fast response, offload complex queries to a server:

import asyncio
from llama_cpp import Llama
import httpx

class HybridLLMRouter:
"""
Route requests to local edge model or cloud server
based on complexity and latency budget.
"""

def __init__(self, local_model_path: str, server_url: str = None):
self.local_model = Llama(model_path=local_model_path, n_threads=8)
self.server_url = server_url

async def generate(self, prompt: str, latency_budget_ms: int = 2000):
"""
Generate response: use local model if it can meet latency budget,
otherwise offload to server.
"""

# Estimate local inference time
# Simple heuristic: prompt_length_tokens / 2 tokens_per_sec * 1000ms
import time
estimated_local_time = len(prompt.split()) / 2 * 1000 # ~2 tokens/sec rough estimate

if estimated_local_time < latency_budget_ms and not self._server_is_available():
# Use local model
print("Using local edge model...")
response = self.local_model(
prompt,
max_tokens=256,
temperature=0.7
)
return response['choices'][0]['text']
else:
# Offload to server
if self.server_url:
print("Offloading to server...")
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{self.server_url}/v1/completions",
json={"prompt": prompt, "max_tokens": 256}
)
return resp.json()['text']
else:
# No server available, fall back to slow local inference
return self.local_model(prompt, max_tokens=256)['choices'][0]['text']

def _server_is_available(self) -> bool:
"""Check if server is reachable."""
if not self.server_url:
return False
try:
import httpx
response = httpx.get(f"{self.server_url}/v1/health", timeout=1)
return response.status_code == 200
except:
return False

# Usage
router = HybridLLMRouter(
local_model_path="./llama-2-3b-q4.gguf", # Fast 3B model locally
server_url="https://api.example.com" # 7B model on server
)

# Local network or no latency budget → use local
response1 = asyncio.run(router.generate("What is Python?", latency_budget_ms=500))
print(response1)

# Latency-sensitive or complex → offload to server
response2 = asyncio.run(router.generate(
"Explain the implications of quantum computing on cryptography.",
latency_budget_ms=200
))
print(response2)

This pattern ensures low latency (local model) while maintaining capability (complex queries to server).

Mobile Deployment: iOS Example

Deploying a quantized LLM on iOS using ONNX Runtime:

import Foundation
import OnnxRuntime

class EdgeLLMViewModel: ObservableObject {
@Published var response: String = ""

private let ortSession: ORTSession

init(modelPath: String) throws {
let ortEnv = try ORTEnv(loggingLevel: .warning)
let sessionOptions = try ORTSessionOptions()
try sessionOptions.setGraphOptimizationLevel(.extended)

self.ortSession = try ORTSession(
env: ortEnv,
modelPath: modelPath,
sessionOptions: sessionOptions
)
}

func generate(prompt: String) async {
let tokens = tokenize(prompt)
var outputTokens: [Int] = []

for _ in 0..<256 {
// Run inference
let inputTensor = try! ORTValue(
tensorData: NSMutableData(bytes: tokens, length: tokens.count * MemoryLayout<Int32>.size),
elementType: .int32,
shape: NSNumber(value: tokens.count)
)

let outputs = try! ortSession.run(
withInputs: ["input_ids": inputTensor],
outputNames: ["logits"],
runOptions: nil
)

let logits = outputs["logits"] as! ORTValue
let nextToken = argmax(logits)
outputTokens.append(nextToken)
}

DispatchQueue.main.async {
self.response = self.detokenize(outputTokens)
}
}

private func tokenize(_ text: String) -> [Int32] {
// Placeholder: real tokenization using BPE
return text.split(separator: " ").map { Int32($0.hashValue) }
}

private func detokenize(_ tokens: [Int]) -> String {
// Placeholder: reverse tokenization
return tokens.map { String($0) }.joined(separator: " ")
}

private func argmax(_ logits: ORTValue) -> Int {
// Get max probability token
return 0 // Placeholder
}
}

Key Takeaways

  • Edge inference requires INT4 quantization: 4.5 GB for a 7B model on mobile.
  • GGUF format is the standard: llama.cpp and Ollama support it natively.
  • CPU inference is 10x slower than GPU: 7 tokens/sec (mobile) vs. 100+ (GPU).
  • Hybrid local + server architecture: Use edge model for fast responses, offload complex queries to server.
  • Acceptable for interactive use: Sub-second TTFT and streaming tokens make slow inference feel responsive.

Frequently Asked Questions

Can I run a 70B model on a mobile phone?

No. A 70B model in INT4 is ~18 GB, exceeding phone storage. You could run a 3B model instead, which offers 70% of capability at 10% of size. For 70B, use server-side inference or hybrid (local 3B + server 70B).

Why is CPU inference so much slower than GPU?

GPUs have thousands of cores (massively parallel); CPUs have 8-16 cores. GPUs are optimized for matrix multiplication; CPUs use general-purpose silicon. The speedup from quantization (INT4 kernels are faster) is real but cannot overcome the core count difference.

Does quantization degrade output quality on edge?

Q4_K_M (4-bit) typically causes < 0.5% perplexity increase on standard benchmarks, which is unnoticeable in practice. Q3_K (3-bit) causes ~2% increase, which may be noticeable on reasoning tasks but fine for chat.

Can I use batching on edge (multiple queries simultaneously)?

Technically yes, but rarely practical. Batching assumes many requests arrive concurrently (like a server). Edge devices run one user's query at a time. Batching would increase latency without benefit.

How do I keep edge models updated (security patches, new models)?

Over-the-air (OTA) updates: periodically download new GGUF files (4-8 GB) via WiFi and background downloads (e.g., iOS Background App Refresh). Use version pinning and checksum validation to prevent supply-chain attacks.

Further Reading