Skip to main content

Smart Model Routing: Route Requests by Difficulty

Smart model routing selects the optimal LLM model for each request based on task difficulty and performance requirements, routing simple tasks to small, cheap models (Haiku, Mistral 7B) and complex reasoning to powerful, expensive models (Sonnet, Opus). A routing system classifies each incoming request (FAQ lookup, classification, code generation) into a difficulty tier, then invokes the appropriate model tier. The cost impact is substantial: a chatbot that routes 70% of questions to Haiku ($0.8 per million input tokens) and 30% to Sonnet ($3 per million) achieves a 60% cost reduction versus always using Sonnet. Routing requires two components: a lightweight classifier that predicts task difficulty, and a decision tree (or learned policy) that maps difficulty to model choice. Building an effective router is the single highest-impact cost optimization after output length reduction, yielding 30–50% savings with minimal accuracy loss.

Understanding Model Tiers and Their Trade-Offs

Modern LLM providers publish models across multiple tiers, each with different capability and cost profiles. As of June 2026, Anthropic publishes Claude models in three tiers:

  • Haiku (entry-level): ~$0.8 per million input tokens, optimized for low-latency, high-volume tasks. Excels at: classification, extraction, simple Q&A, summarization. Weak on: multi-step reasoning, creative writing, code generation on novel problems.
  • Sonnet (mid-tier): ~$3 per million input tokens, balanced capability and cost. Excels at: coding, reasoning, nuanced analysis, instruction-following. Good-enough for most tasks.
  • Opus (flagship): ~$15 per million input tokens, maximum capability. Excels at: complex multi-step reasoning, novel problem-solving, long-context analysis. Reserve for critical/high-value requests.

OpenAI and Google offer similar tiering. The rule of thumb: use the smallest model that solves your problem. A support chatbot answering "How do I reset my password?" does not need Sonnet; Haiku is fine. But "Design a system to reduce database latency by 50%" needs Sonnet or Opus. The cost difference is dramatic: a 100-token response from Haiku costs $0.0004 in output, while Sonnet costs $0.0015—nearly 4× more for identical input length.

Building a Request Classifier

A request classifier is a lightweight model (often rule-based or a small ML model) that predicts task difficulty or category without calling the main LLM. The classifier runs first, makes a go/no-go decision, and routes the request accordingly. There are three common approaches:

  1. Rule-based: Use heuristics—keyword matching, regex, input length—to classify. Fast and deterministic but brittle.
  2. Small ML model: Train a logistic regression or decision tree on historical request samples, labeled by actual difficulty or performance. Requires labeled data but adapts to your specific use cases.
  3. Lightweight LLM pre-check: Use Haiku to make a quick difficulty classification, then route the full request to the appropriate tier. Adds a small cost (one Haiku call) but is accurate and adapts over time.

Here is a Python example combining rule-based classification with a small learnable component:

import anthropic
from enum import Enum

class Difficulty(Enum):
SIMPLE = "haiku" # Low reasoning, high volume
MODERATE = "sonnet" # Balanced capability
COMPLEX = "opus" # Reasoning-heavy, accuracy-critical

def classify_request_difficulty(query: str) -> Difficulty:
"""
Classify request difficulty using heuristics + keyword analysis.
Falls back to Haiku pre-check for borderline cases.
"""

# Rule-based heuristics
simple_keywords = [
"what is", "how do i", "password", "reset", "faq",
"status", "hours", "phone number", "address",
]
complex_keywords = [
"why", "analyze", "compare", "strategy", "design",
"complex", "reasoning", "code", "architecture", "optimize",
]

query_lower = query.lower()
simple_score = sum(1 for kw in simple_keywords if kw in query_lower)
complex_score = sum(1 for kw in complex_keywords if kw in query_lower)

if simple_score > 2:
return Difficulty.SIMPLE
elif complex_score > 2:
return Difficulty.COMPLEX
elif len(query) < 50:
return Difficulty.SIMPLE
elif len(query) > 500:
return Difficulty.COMPLEX
else:
# Borderline: use Haiku to classify
return classify_with_haiku(query)

def classify_with_haiku(query: str) -> Difficulty:
"""Ask Haiku to classify this request as simple/moderate/complex."""
client = anthropic.Anthropic()

response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=10,
messages=[
{
"role": "user",
"content": f"Rate this request as 'simple', 'moderate', or 'complex': {query}",
}
],
)

text = response.content[0].text.lower()
if "simple" in text:
return Difficulty.SIMPLE
elif "moderate" in text:
return Difficulty.MODERATE
else:
return Difficulty.COMPLEX

def route_request(query: str) -> str:
"""Route request to appropriate model based on difficulty."""
client = anthropic.Anthropic()

difficulty = classify_request_difficulty(query)

if difficulty == Difficulty.SIMPLE:
model = "claude-3-5-haiku-20241022"
elif difficulty == Difficulty.MODERATE:
model = "claude-3-5-sonnet-20241022"
else: # COMPLEX
model = "claude-opus-4-1-20250805"

response = client.messages.create(
model=model,
max_tokens=500,
messages=[
{
"role": "user",
"content": query,
}
],
)

# Log routing decision for analysis
print(f"Query: {query[:50]}... => {difficulty.value} ({model})")

return response.content[0].text

This classifier is fast: rule-based checks run in microseconds, and Haiku pre-checks cost ~$0.0001–$0.0003 per call (negligible). The classifier enables routing: 70% of requests hit Haiku (cheap), 20% hit Sonnet (moderate), 10% hit Opus (expensive). This distribution cuts average cost by 60% versus always using Sonnet.

Measuring Routing Accuracy and Performance

After deploying a router, audit its accuracy: log each request's classification, actual model used, and whether the response was satisfactory (binary: yes/no, or a quality score). Weekly, compute: (1) accuracy per tier = what fraction of requests in each tier were answered satisfactorily; (2) cost savings = compare routed costs versus always-Sonnet baseline; (3) misclassifications = what fraction of complex requests were incorrectly classified as simple. If Haiku accuracy drops below 85% on complex tasks, tighten the classification rules or increase the percentage routed to Sonnet. A well-tuned router achieves 90%+ accuracy on all tiers while maintaining 40–50% cost savings.

Here is a measurement framework:

from datetime import datetime, timedelta

def log_routing_decision(
query_hash: str,
classified_difficulty: Difficulty,
model_used: str,
output_tokens: int,
user_satisfied: bool, # Binary feedback: yes/no
):
"""Log routing decision and outcome for analysis."""
event = {
"timestamp": datetime.utcnow().isoformat(),
"query_hash": query_hash,
"classified_difficulty": classified_difficulty.value,
"model_used": model_used,
"output_tokens": output_tokens,
"user_satisfied": user_satisfied,
}
# Append to cost_events.jsonl or database
print(event)

def compute_routing_accuracy(days_back: int = 7):
"""
Analyze routing accuracy over last N days.
Returns per-tier accuracy and overall cost savings.
"""
# Query logs: SELECT * FROM routing_events WHERE timestamp > NOW() - INTERVAL 'N days'
# (Pseudo-code; adapt to your log store)

tiers = {"haiku": [], "sonnet": [], "opus": []}
total_cost_routed = 0
total_cost_always_sonnet = 0

# In practice, iterate over log entries
for event in load_recent_events(days=days_back):
tier = event["model_used"]
satisfied = event["user_satisfied"]

if tier in tiers:
tiers[tier].append(satisfied)

# Accumulate costs
total_cost_routed += event["cost"]
total_cost_always_sonnet += estimate_sonnet_cost(event)

# Compute accuracy per tier
accuracy = {}
for tier, results in tiers.items():
if results:
accuracy[tier] = sum(results) / len(results)

savings = total_cost_always_sonnet - total_cost_routed
savings_pct = (savings / total_cost_always_sonnet) * 100

print(f"Accuracy per tier: {accuracy}")
print(f"Cost savings: ${savings:.2f} ({savings_pct:.1f}%)")

return accuracy, savings_pct

Run this analysis weekly and act on results: if Haiku accuracy drops, investigate the failed cases and either (1) tighten classification rules, (2) add more of those queries to Sonnet, or (3) improve Haiku prompting (e.g., with few-shot examples). If Haiku is 98% accurate, consider routing more simple queries to it; the cost savings scale.

Dynamic Routing Based on Latency and Availability

Advanced routing systems add latency and availability signals. If Haiku is returning errors or exceeding latency SLOs (e.g., >2 seconds), upgrade that request to Sonnet. If a request has a strict deadline (user waiting for a response), route to the faster model. If batch processing (no deadline), route to the cheapest model. Here is a more sophisticated router:

import time

def route_with_latency_budget(
query: str,
max_latency_ms: int = 3000, # User wait threshold
):
"""Route based on difficulty AND latency budget."""
client = anthropic.Anthropic()

difficulty = classify_request_difficulty(query)

# Choose model based on difficulty
if difficulty == Difficulty.SIMPLE:
primary_model = "claude-3-5-haiku-20241022"
fallback_model = "claude-3-5-sonnet-20241022"
elif difficulty == Difficulty.MODERATE:
primary_model = "claude-3-5-sonnet-20241022"
fallback_model = "claude-opus-4-1-20250805"
else:
primary_model = "claude-opus-4-1-20250805"
fallback_model = None

start = time.time()

try:
response = client.messages.create(
model=primary_model,
max_tokens=500,
messages=[{"role": "user", "content": query}],
)
elapsed_ms = (time.time() - start) * 1000

# If response is slow, log for re-routing analysis
if elapsed_ms > max_latency_ms * 0.8: # 80% of threshold
print(f"Slow response on {primary_model}: {elapsed_ms:.0f}ms")

return response.content[0].text, primary_model, elapsed_ms

except Exception as e:
# If primary model fails, escalate
if fallback_model:
print(f"Primary model failed ({e}), escalating to {fallback_model}")
response = client.messages.create(
model=fallback_model,
max_tokens=500,
messages=[{"role": "user", "content": query}],
)
elapsed_ms = (time.time() - start) * 1000
return response.content[0].text, fallback_model, elapsed_ms
else:
raise

This pattern ensures you get responses within your latency budget while minimizing cost. If Haiku is slow or erroring, you escalate transparently; the cost is slightly higher but availability is maintained.

Key Takeaways

  • Route requests to appropriate models: Haiku for simple tasks (40% cost vs Sonnet), Sonnet for moderate tasks, Opus for complex reasoning.
  • Build a lightweight classifier (rules + keyword matching) or use Haiku pre-check to determine task difficulty before routing.
  • Measure routing accuracy weekly: aim for 90%+ accuracy per tier while maintaining 40–50% cost savings versus always-Sonnet baseline.
  • Add latency and error handling to routers: escalate if primary model is slow or failing, ensuring quality while minimizing cost.
  • Log routing decisions and outcomes to enable continuous improvement of classification rules.

Frequently Asked Questions

How do I know which tasks are safe to route to Haiku?

Test! Historically, classification, extraction, and simple Q&A work well on Haiku. Code generation, reasoning, and creative writing often fail. Start conservative: route only obvious-simple queries (keyword-matched), measure accuracy, then expand. Aim for 95%+ accuracy per task type before automation.

Should I always route to Haiku first and escalate on failure?

Yes, if latency permits. Send every request to Haiku with a 5-second timeout; if it succeeds, great. If it times out or fails, escalate to Sonnet. This maximizes Haiku usage (cost) without hurting quality. But if your user-facing latency budget is <2 seconds, this pattern is too slow; pre-classify instead.

What if my costs are dominated by one feature? Should I special-case it?

Yes. If FAQ response generation is 40% of your budget, build a specialized FAQ route: use a retrieval system to find the canned answer, then call Haiku only to format it. Or use Haiku + templating, bypassing the LLM entirely. Specialize aggressively; generalized routing is simple but leaves money on the table.

Can I A/B test routing strategies?

Absolutely. Route 50% of traffic to Router A (always Sonnet baseline) and 50% to Router B (smart routing), measure cost and quality metrics. If Router B achieves 90% quality at 60% cost, switch over. A/B testing takes 1–2 weeks but is the gold standard for validating improvements.

How often should I retrain my classification model?

If using a learned classifier (logistic regression), retrain weekly on new logs. If using rule-based classification, audit quarterly and adjust thresholds if accuracy drifts. Classification rules are stable; retraining is mainly about surfacing new edge cases.

Further Reading