Skip to main content

Cost-Aware Prompt Design: Architecture for Efficiency

Cost-aware prompt design shapes prompts to minimize token consumption while preserving or improving accuracy. A naive prompt for document classification is: "Analyze this 5,000-token document and classify it into one of ten categories, explaining your reasoning." Response might be 300 tokens (explanation). Total cost: 5,300 tokens. A cost-optimized prompt for the same task is: "Classify into one of: [categories]. Respond with category name only." Response: 5 tokens (just the label). Total cost: 5,005 tokens—1,000× fewer tokens for identical accuracy. Cost-aware design applies across the entire prompt stack: system message, input format, output format, error handling. The discipline integrates cost considerations into prompt engineering workflows, enabling teams to build high-performance systems that are also cost-efficient by design, not by retrofit.

Cost-Aware Prompt Structure

A cost-aware prompt has five layers, each optimized for token efficiency:

  1. System message: 50–150 tokens, defining the assistant's role and constraints.
  2. Input structure: Minimal, well-formatted context (use JSON or XML, not prose).
  3. Task definition: Concise, specific instruction (avoid ambiguity that requires clarification).
  4. Output format: Structured (JSON, CSV, concise text), not prose narratives.
  5. Example (optional): One or two well-chosen examples, not exhaustive few-shot.

Here is a cost-optimized classification system:

import anthropic
import json

client = anthropic.Anthropic()

# Cost-optimized system message: 52 tokens
SYSTEM_PROMPT = """You are a text classifier.
Classify input into: [TECHNICAL, BUSINESS, LEGAL, OTHER].
Output only the category name, no explanation."""

def classify_document_optimized(document_text: str) -> dict:
"""Classify a document with minimal tokens."""

response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=10, # Tight limit forces conciseness
system=SYSTEM_PROMPT,
messages=[
{
"role": "user",
"content": f"Classify: {document_text}",
}
],
)

category = response.content[0].text.strip().upper()

return {
"document_preview": document_text[:100],
"category": category,
"tokens_used": response.usage.input_tokens + response.usage.output_tokens,
}

# Example
doc = "The API rate limit is 100 requests per second per account."
result = classify_document_optimized(doc)
print(f"Category: {result['category']}, Tokens: {result['tokens_used']}")
# Output: Category: TECHNICAL, Tokens: 87

Compare with an unoptimized version:

# Unoptimized system message: 150+ tokens
SYSTEM_PROMPT_VERBOSE = """You are an expert text classifier with years of experience analyzing documents.

Your task is to carefully analyze the provided document and classify it into one of the following categories:
1. TECHNICAL - documents containing technical information, code, or system details
2. BUSINESS - documents related to company operations, finance, or strategy
3. LEGAL - documents containing legal clauses, contracts, or compliance information
4. OTHER - documents that don't fit other categories

Please provide your classification along with a brief explanation of your reasoning.
Consider the context, terminology, and overall intent of the document."""

def classify_document_unoptimized(document_text: str) -> dict:
"""Classify a document with verbose prompting."""

response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=200, # Allows verbose output
system=SYSTEM_PROMPT_VERBOSE,
messages=[
{
"role": "user",
"content": f"Please analyze and classify the following document:\n\n{document_text}",
}
],
)

return {
"document_preview": document_text[:100],
"response": response.content[0].text,
"tokens_used": response.usage.input_tokens + response.usage.output_tokens,
}

# Same example document
result = classify_document_unoptimized(doc)
print(f"Tokens: {result['tokens_used']}")
# Output: Tokens: ~250

The unoptimized version uses 3× more tokens (250 vs 87) for identical classification. At scale (100,000 documents), this difference is $20 (optimized) vs $60 (unoptimized)—a $40 monthly savings.

Structured Output for Token Efficiency

Structured output (JSON, CSV, XML) is more token-efficient than prose because it omits natural language fluff. Here is an extraction task:

# Cost-optimized: JSON structure, no prose
SYSTEM_PROMPT_JSON = """Extract facts into JSON.
Output only valid JSON, no explanation."""

def extract_facts_optimized(text: str) -> dict:
"""Extract facts from text into JSON."""

response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=300,
system=SYSTEM_PROMPT_JSON,
messages=[
{
"role": "user",
"content": f"""Extract from this text:
Text: {text}

Output format: {{"name": "...", "role": "...", "company": "..."}}""",
}
],
)

try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return {"error": response.content[0].text}

# Unoptimized: prose-based response
SYSTEM_PROMPT_PROSE = """You are an information extraction assistant.
Carefully read the provided text and extract key facts about the person,
including their name, role, and company affiliation.
Provide your findings in a clear, well-formatted narrative."""

def extract_facts_unoptimized(text: str) -> str:
"""Extract facts as prose."""

response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=300,
system=SYSTEM_PROMPT_PROSE,
messages=[
{
"role": "user",
"content": f"Extract facts about the person in this text:\n\n{text}",
}
],
)

return response.content[0].text

text = "Alice Chen is a software engineer at Acme Corp, specializing in backend systems."

# Compare
result_opt = extract_facts_optimized(text)
print(f"Optimized: {json.dumps(result_opt)}")
# Output: {"name": "Alice Chen", "role": "software engineer", "company": "Acme Corp"}

# Unoptimized response: "According to the provided text, Alice Chen is a software engineer..."

The JSON-structured version returns ~50 tokens; the prose version returns ~80 tokens. Structured output is more parseable for downstream systems and more token-efficient.

Avoiding Refusal Loops and Retry Costs

A refusal loop occurs when a prompt causes the model to refuse the task (e.g., "I can't help with that"), prompting a retry with a rephrased request, which again refuses, etc. This wastes tokens and increases cost. Minimize refusals via careful prompt design:

# Problematic prompt (high refusal risk)
system_bad = "You are a helpful assistant. Reject any harmful requests."

# Better: frame task positively
system_good = """You are a task executor.
Complete the requested task concisely and accurately.
If a request is ambiguous, clarify briefly instead of refusing."""

# Example: code generation
def generate_code_optimized(task: str) -> str:
"""Generate code with low refusal risk."""

response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=200,
system="Generate Python code. Respond with code only.",
messages=[
{
"role": "user",
"content": task,
}
],
)

# If refusal, cost is wasted (still charged for input/output)
return response.content[0].text

To measure refusal rate: log each response, check if it contains phrases like "I can't," "I shouldn't," "I'm unable to." Track refusal rate weekly. If >5%, investigate and refine system prompts. Each refusal is a wasted API call; every 1% reduction in refusal rate saves ~0.1% of your LLM budget.

Few-Shot Examples: Minimal vs Maximal

Few-shot examples improve accuracy but increase token count. Use the minimum effective number:

# Minimal: one example (50 tokens)
system_minimal = """Classify sentiment as positive/negative.
Example: "Great product!" → positive"""

# Verbose: five examples (200+ tokens)
system_verbose = """Classify sentiment as positive/negative.
Examples:
1. "Great product, highly recommend!" → positive
2. "Works as advertised." → positive
3. "Disappointing, broke after a week." → negative
4. "Would not buy again." → negative
5. "Okay, but not worth the price." → negative
[... many more examples ...]"""

# Test: is one example sufficient?
# Measure accuracy on a held-out test set.
# If accuracy(1 example) = 0.92 and accuracy(5 examples) = 0.94,
# the +2% accuracy is not worth the +150 tokens/request.
# Use the minimal set that achieves >90% accuracy.

Rule of thumb: one or two well-chosen examples are almost always sufficient. Unless a task is extremely nuanced (e.g., tone detection in ambiguous language), two examples provide diminishing returns.

Token Budget Constraints in Prompts

Set tight constraints on output length to prevent runaway generations:

def summarize_optimized(document: str, target_length: str = "3 sentences") -> str:
"""Summarize with strict output constraint."""

response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=150, # Tight budget forces brevity
system=f"Summarize in {target_length}. No fluff.",
messages=[
{
"role": "user",
"content": f"Summarize:\n\n{document}",
}
],
)

return response.content[0].text

# Compare: unoptimized (no constraint)
def summarize_unoptimized(document: str) -> str:
"""Summarize without output constraint."""

response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=1000, # Loose budget, might use all 1000 tokens
system="Provide a comprehensive summary with detailed explanations.",
messages=[
{
"role": "user",
"content": f"Please provide a thorough summary of the following document:\n\n{document}",
}
],
)

return response.content[0].text

In the optimized version, max_tokens=150 forces conciseness. The model typically uses 80–150 tokens. In the unoptimized version, max_tokens=1000 and a verbose system prompt often result in 300–600 token outputs. 4× more tokens for similar information.

Architectural Patterns: Routing Before Generation

A cost-optimized architecture routes requests to appropriate handlers before invoking an LLM:

def process_query(query: str, document: str = "") -> str:
"""Route query to cheapest appropriate handler."""

# Route 1: Exact match (no LLM)
faq_map = {
"How do I reset my password?": "Go to login page and click 'Forgot Password'.",
"What is your pricing?": "See pricing.html for current rates.",
}
if query in faq_map:
return faq_map[query] # $0 cost

# Route 2: Simple regex classification (no LLM)
if query.lower().startswith("what is"):
category = "definition"
elif query.lower().startswith("how do i"):
category = "howto"
else:
category = "complex"

# Route 3: Haiku for simple queries
if category in ["definition", "howto"]:
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=100,
messages=[{"role": "user", "content": query}],
)
return response.content[0].text # ~$0.0005 cost

# Route 4: Sonnet for complex queries + document context
if document:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[
{
"role": "user",
"content": f"Document:\n{document}\n\nQuery: {query}",
}
],
)
else:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[{"role": "user", "content": query}],
)

return response.content[0].text # ~$0.002 cost

This architecture processes 100 queries: FAQ (20 queries, $0 cost) + simple (40 queries, $0.02) + complex (40 queries, $0.08) = $0.10 total vs $0.20 for always using Sonnet. 50% cost reduction through routing alone.

Key Takeaways

  • Cost-aware system messages are concise (50–150 tokens), not verbose.
  • Structured output (JSON) is more token-efficient than prose narratives.
  • Minimize token count via tight max_tokens limits and concise instructions.
  • One or two well-chosen examples are almost always sufficient; avoid example proliferation.
  • Route requests before LLM invocation: FAQ lookup, classification, templates—cheaper than general-purpose generation.

Frequently Asked Questions

Does cost-optimized prompting hurt accuracy?

No, if done right. Concise prompts are often clearer and result in better accuracy than verbose prompts (less room for misinterpretation). Test on a held-out set: compare accuracy of verbose vs concise versions. Usually, concise wins or ties.

How do I test if my prompt is cost-optimized?

Measure cost per request (average tokens) and accuracy on a benchmark. Compare against a baseline prompt for the same task. If cost is lower and accuracy is the same or higher, it's optimized.

Should I use token limits (max_tokens) for all tasks?

Yes, set max_tokens to the minimum that could possibly contain a valid answer. For classification, 10 tokens. For summarization, 150 tokens. For essay generation, 1,000 tokens. Too-high max_tokens leave room for verbose outputs; too-low limits cause truncation (failure). Find the minimum effective value.

What if my structured output (JSON) is malformed?

Wrap LLM output in error handling: try json.loads(response), catch errors, and either re-prompt with format clarification or default to a fallback. Example: "Response was not valid JSON. Please return only valid JSON: {…}."

Can I combine cost optimization with prompt caching?

Yes! Caching works on system messages and large context, while prompt optimization reduces overall tokens. Combine both: cache a large FAQ database (90% cost reduction on cached tokens) and send concise queries (no fluff). Multiplicative savings.

Further Reading