Skip to main content

Cost Optimization Through Observability: Reduce Token Spend

LLM costs compound rapidly at scale: a 10% improvement in token efficiency saves thousands of dollars annually for applications running millions of inferences. Observability enables cost optimization by revealing where tokens are spent (which prompts, models, or user segments are most expensive) and identifying patterns that lead to waste (redundant context, verbose system prompts, incorrect model selection). This article walks you through using observability data to profile costs, identify optimization opportunities, and validate improvements.

Cost Attribution and Profiling

The first step is understanding where costs originate: which prompts, models, and user segments consume the most tokens?

Prompt-Level Cost Analysis

Track tokens and cost for each unique prompt or prompt template:

from collections import defaultdict
from langfuse import Langfuse
import json

langfuse = Langfuse()

def analyze_prompt_costs(hours_back: int = 24) -> dict:
"""Analyze token consumption by prompt template."""

# Fetch all traces in the past 24 hours
traces = langfuse.get_traces(
limit=10000,
from_timestamp=(datetime.utcnow() - timedelta(hours=hours_back)).isoformat()
)

prompt_stats = defaultdict(lambda: {
"count": 0,
"total_input_tokens": 0,
"total_output_tokens": 0,
"total_cost": 0.0
})

for trace in traces.data:
# Extract prompt template (first N chars as identifier)
prompt_id = trace.input[:80] if trace.input else "unknown"

# Aggregate tokens and cost
prompt_stats[prompt_id]["count"] += 1
prompt_stats[prompt_id]["total_input_tokens"] += trace.input_tokens or 0
prompt_stats[prompt_id]["total_output_tokens"] += trace.output_tokens or 0
prompt_stats[prompt_id]["total_cost"] += trace.cost or 0.0

# Sort by cost (descending)
sorted_prompts = sorted(
prompt_stats.items(),
key=lambda x: x[1]["total_cost"],
reverse=True
)

# Print top 10 expensive prompts
print("Top 10 expensive prompts (24h):")
for i, (prompt_id, stats) in enumerate(sorted_prompts[:10], 1):
avg_cost = stats["total_cost"] / stats["count"] if stats["count"] > 0 else 0
print(f"{i}. Prompt: {prompt_id[:60]}...")
print(f" Calls: {stats['count']}, Avg cost: ${avg_cost:.6f}, Total: ${stats['total_cost']:.2f}")

return dict(sorted_prompts)

# Run analysis
from datetime import datetime, timedelta
prompt_costs = analyze_prompt_costs(hours_back=24)

From this analysis, identify the top 5 most expensive prompts. Then investigate: is the cost due to high input token count (verbose context)? High output token count (verbose responses)? Inefficient prompting (asking the model to do extra work)?

Cost by Model

Track which models consume the most tokens and cost the most:

def analyze_cost_by_model(hours_back: int = 24) -> dict:
"""Analyze token consumption and cost by model."""

traces = langfuse.get_traces(
limit=10000,
from_timestamp=(datetime.utcnow() - timedelta(hours=hours_back)).isoformat()
)

model_stats = defaultdict(lambda: {
"count": 0,
"total_cost": 0.0,
"total_tokens": 0
})

for trace in traces.data:
model = trace.metadata.get("model") if trace.metadata else "unknown"
model_stats[model]["count"] += 1
model_stats[model]["total_cost"] += trace.cost or 0.0
model_stats[model]["total_tokens"] += (trace.input_tokens or 0) + (trace.output_tokens or 0)

# Print cost by model
print("Cost breakdown by model:")
for model, stats in sorted(model_stats.items(), key=lambda x: x[1]["total_cost"], reverse=True):
cost_per_ktoken = (stats["total_cost"] / (stats["total_tokens"] / 1000)) if stats["total_tokens"] > 0 else 0
print(f"{model}: {stats['count']} calls, {stats['total_tokens']:,} tokens, ${stats['total_cost']:.2f} (${cost_per_ktoken:.4f} per 1K tokens)")

return dict(model_stats)

model_costs = analyze_cost_by_model(hours_back=24)

If a cheaper model (e.g., Claude 3.5 Haiku, $0.80 per 1M input tokens) produces acceptable quality for a task currently using an expensive model (Claude 3 Opus, $15 per 1M input tokens), migrating that task can reduce costs by 95%.

Cost Optimization Strategies

1. Prompt Optimization

Shorter, more concise prompts consume fewer input tokens while often improving model performance:

# Example: Verbose prompt vs optimized prompt

verbose_prompt = """
You are an expert customer service representative for an e-commerce company.
Your job is to help customers with their inquiries about orders, returns, shipping,
and general product questions. You should be friendly, helpful, and professional.
You should always try to resolve the customer's issue as quickly as possible.
If you cannot resolve the issue, escalate to a human agent.
Today's date is 2026-06-02. The customer asking the question is:

{customer_info}

The customer's question is:
{customer_question}

Please respond with a helpful answer.
"""

optimized_prompt = """
You are a customer service rep. Answer briefly and helpfully.

Customer: {customer_info}
Question: {customer_question}
"""

# Token count comparison
verbose_tokens = len(verbose_prompt.split()) # ~100 tokens
optimized_tokens = len(optimized_prompt.split()) # ~20 tokens

# Savings: 80% fewer input tokens
print(f"Verbose: ~{verbose_tokens} tokens, Optimized: ~{optimized_tokens} tokens")
print(f"Savings: {((verbose_tokens - optimized_tokens) / verbose_tokens * 100):.0f}%")

Strategies:

  • Remove redundant instructions (the model learns from context).
  • Use examples instead of lengthy explanations.
  • Put instructions in the system message (may be cheaper in some APIs).
  • Use parameterization: {variable} instead of repeating full context.

2. Caching Repeated Context

If the same context (e.g., product catalog, user profile) is included in many requests, cache it and reuse:

from anthropic import Anthropic

def query_with_cached_context(user_message: str, customer_id: str):
"""Use Claude's prompt caching to reduce cost of repeated context."""

client = Anthropic()

# Fetch customer context (in production, cache this in your database)
customer_context = fetch_customer_context(customer_id)

# Use Anthropic's prompt caching (beta feature)
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer service representative."
},
{
"type": "text",
"text": f"Customer context:\n{customer_context}",
# With prompt caching, this context is cached and reused across requests
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_message}]
)

# Cost with caching: first request pays full cost, subsequent requests pay only cache read cost (10-30% of write cost)
cache_read_cost = response.usage.cache_read_input_tokens * 0.00000375 if hasattr(response.usage, 'cache_read_input_tokens') else 0
regular_cost = response.usage.input_tokens * 0.000015
total_cost = cache_read_cost + regular_cost

print(f"Cost: ${total_cost:.6f} (cache read: {response.usage.cache_read_input_tokens if hasattr(response.usage, 'cache_read_input_tokens') else 0} tokens)")

return response.content[0].text

def fetch_customer_context(customer_id: str) -> str:
"""Fetch customer profile, order history, etc."""
return f"Customer ID: {customer_id}\nOrders: 5\nTier: Premium"

Caching can reduce repeated-context costs by 70% after the first request.

3. Model Selection and Batching

For batch/non-interactive jobs, use cheaper, slower models:

def choose_model_for_task(task_type: str) -> str:
"""Choose the most cost-effective model for a given task."""

model_recommendations = {
"email_summarization": "claude-3-5-haiku-20250128", # Fast, cheap, good for simple tasks
"code_review": "claude-3-opus-20250219", # Expensive but needed for complex analysis
"customer_support": "claude-3-5-sonnet-20250514", # Balanced cost/performance
"batch_processing": "claude-3-5-haiku-20250128", # Use cheapest for high-volume background jobs
}

return model_recommendations.get(task_type, "claude-3-5-sonnet-20250514")

def batch_process_documents(documents: list[str]) -> list[str]:
"""Process documents in batch (slower but cheaper)."""

from anthropic import Anthropic
import time

client = Anthropic()

results = []
for i, doc in enumerate(documents):
# Use cheaper model for batch processing
model = choose_model_for_task("batch_processing")

response = client.messages.create(
model=model,
max_tokens=512,
messages=[{"role": "user", "content": f"Summarize: {doc[:500]}"}]
)

results.append(response.content[0].text)

# Add delay to avoid rate limits (batch jobs are not latency-critical)
if (i + 1) % 10 == 0:
time.sleep(1)

return results

# Example: Processing 1000 documents
# Interactive model (Claude 3 Opus) for 1000 docs: ~100 input tokens * 1000 * $0.000015 = $1.50
# Batch model (Claude 3.5 Haiku) for 1000 docs: ~100 input tokens * 1000 * $0.00000080 = $0.08
# Savings: 95% on compute, only trade latency (15 seconds vs 1 second per request)

4. Output Token Management

Control output token usage by setting max_tokens appropriately:

def llm_call_with_token_limit(user_message: str, task_type: str) -> str:
"""Limit output tokens based on task requirements."""

max_tokens_by_task = {
"classification": 10, # Just output category
"extraction": 500, # Extract key fields
"summarization": 300, # Concise summary
"generation": 2000, # Full response
}

max_tokens = max_tokens_by_task.get(task_type, 1024)

client = Anthropic()
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=max_tokens, # Constrain output
messages=[{"role": "user", "content": user_message}]
)

return response.content[0].text

# Example: Classification task
# Unrestricted max_tokens=2048: model generates 150 tokens on average = $0.001125
# Restricted max_tokens=10: model generates 8 tokens on average = $0.00006
# Savings: 95% on output tokens

Validating Cost Reductions

After implementing optimizations, measure their impact via observability:

def compare_cost_before_after(old_traces_query: str, new_traces_query: str) -> dict:
"""Compare average cost before and after optimization."""

# Fetch old traces (before optimization)
old_traces = langfuse.search_traces(query=old_traces_query, limit=1000)
old_avg_cost = sum(t.cost for t in old_traces.data) / len(old_traces.data) if old_traces.data else 0
old_total_cost = sum(t.cost for t in old_traces.data)

# Fetch new traces (after optimization)
new_traces = langfuse.search_traces(query=new_traces_query, limit=1000)
new_avg_cost = sum(t.cost for t in new_traces.data) / len(new_traces.data) if new_traces.data else 0
new_total_cost = sum(t.cost for t in new_traces.data)

# Calculate savings
savings_pct = ((old_avg_cost - new_avg_cost) / old_avg_cost * 100) if old_avg_cost > 0 else 0

print(f"Cost optimization results:")
print(f" Before: ${old_avg_cost:.6f} per inference, ${old_total_cost:.2f} total")
print(f" After: ${new_avg_cost:.6f} per inference, ${new_total_cost:.2f} total")
print(f" Savings: {savings_pct:.1f}% per inference")

return {
"old_avg_cost": old_avg_cost,
"new_avg_cost": new_avg_cost,
"savings_pct": savings_pct,
"monthly_savings": (old_avg_cost - new_avg_cost) * 30000 # Assuming 30k inferences/day
}

# Example
improvements = compare_cost_before_after(
old_traces_query="metadata.version=1.0",
new_traces_query="metadata.version=2.0"
)
print(f"Monthly savings: ${improvements['monthly_savings']:.2f}")

Key Takeaways

  • Profile costs by prompt, model, and user segment to identify optimization opportunities.
  • Optimize prompts: remove verbose instructions, use caching for repeated context, constrain output token limits.
  • Choose cheaper models for non-interactive tasks; use fast models only for latency-critical operations.
  • Implement prompt caching to reduce repeated-context costs by 70%.
  • Validate improvements by comparing traces before and after optimization via observability dashboards.

Frequently Asked Questions

How much can I realistically reduce costs through optimization?

10–30% reduction is achievable through prompt optimization and better model selection. 50%+ reduction is possible if you combine multiple strategies (caching, batching, cheaper models, output limits). After 50%, diminishing returns set in; further savings require architectural changes (fallback systems, user-facing choices).

Should I always use the cheapest model?

No. Cheaper models may produce lower-quality outputs, leading to user churn or resubmissions (which increases cost). Use cost vs. quality trade-off analysis: if a 10% cost reduction leads to 5% user churn, it is not worth it. Always A/B test cheaper models with a subset of users first.

Is prompt caching worth implementing?

Yes, if you have repeated context across requests (e.g., customer support with user profiles). Break-even point is typically 10–20 repeated requests. For one-off queries, caching overhead is not justified.

How do I handle dynamic context that changes frequently?

For context that updates hourly or daily, use cache invalidation: set an ephemeral cache TTL and refresh context once the cache expires. For per-request dynamic data, embed only the essential subset in the prompt and fetch the rest from a database.

Further Reading