Skip to main content

Prompt Compression for Cost: Reduce Token Spend 40%+

Prompt compression is the art of reducing token count while preserving the semantic information the LLM needs to answer a query accurately. A naive approach to building a document Q&A system embeds the entire document (10,000 tokens) in every request; a compressed approach uses retrieval-augmented generation (RAG) to fetch only the relevant passages (1,000 tokens), reducing token spend by 90%. Compression techniques range from simple (remove whitespace, truncate history, reuse cached context) to sophisticated (learned token pruning, semantic summarization, multi-stage routing). The cost impact is dramatic: a system processing 1,000 document-QA requests per day with 10,000-token documents saves $21,600 annually by compressing to 1,000-token retrieved passages. Prompt compression is the second-highest-impact optimization after model routing, yielding 40–70% cost reductions with minimal quality loss when done right.

Understanding Compression Trade-Offs

Compression trades off token count for retrieval accuracy. The question "What is the quarterly revenue of ACME Corp in Q2 2024?" in a 20,000-token earnings transcript requires only the Q2 2024 section (500 tokens) to answer. But if your retriever is weak and fetches the Q1 section instead, the LLM will hallucinate or say "not found." So compression quality depends entirely on your retrieval system's precision and recall. A high-precision retriever (finds the right section 95% of the time) enables aggressive compression (1,000-token budget per request). A low-precision retriever (70% precision) might need a higher budget (5,000 tokens, including multiple candidate sections) to ensure accuracy. Measure retrieval precision and recall on your document corpus; use those metrics to set compression budgets. Start conservative (high token budget, low risk of accuracy loss) and gradually tighten as you gain confidence in your retriever.

Retrieval-Augmented Generation (RAG): The Primary Compression Technique

RAG is the most effective compression method: chunk your documents into passages (200–500 tokens each), embed them into a vector database, and for each query, retrieve the top K most similar passages. The LLM answers based on the retrieved passages alone, not the full document. Here is a Python example using LangChain and Anthropic:

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
import anthropic

# Step 1: Load document and chunk
document_text = open("earnings_transcript.txt").read() # 50,000 tokens
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
)
chunks = splitter.split_text(document_text) # ~100 chunks of 500 tokens each
print(f"Chunked document into {len(chunks)} passages")

# Step 2: Embed and index
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = FAISS.from_texts(chunks, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Step 3: Answer query using RAG
client = anthropic.Anthropic()

def answer_with_rag(query: str) -> str:
"""Answer a query using RAG: retrieve relevant passages, then call LLM."""

# Retrieve top 3 passages
retrieved_docs = retriever.get_relevant_documents(query)
context = "\n---\n".join([doc.page_content for doc in retrieved_docs])

# Count tokens in retrieved context
count_response = client.messages.count_tokens(
model="claude-3-5-sonnet-20241022",
messages=[
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}",
}
],
)
print(f"Retrieved context: {count_response.input_tokens} tokens")

# Call LLM with only retrieved context (compressed!)
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}",
}
],
)

return response.content[0].text

# Answer a query
answer = answer_with_rag(
"What was ACME Corp's revenue in Q2 2024?"
)
print(answer)

Without RAG, this query would embed the full 50,000-token transcript (costing $0.15 in input tokens). With RAG, you retrieve ~1,500 tokens (costing $0.0045), a 97% cost reduction. The retriever's embedding cost (one-time, offline) is amortized across many queries.

Context Summarization for Conversation History

In conversational systems, the full history grows unbounded: a ten-turn conversation with 100-token exchanges consumes 1,000 tokens just for history. Compression technique: periodically summarize the conversation into a bullet-point summary (50–100 tokens) and discard the original messages. Here is a pattern:

import anthropic
from typing import TypedDict

class ConversationState(TypedDict):
turns: list[dict]
summary: str
summary_token_count: int

client = anthropic.Anthropic()

def add_turn_and_compress(
state: ConversationState,
user_message: str,
assistant_response: str,
) -> ConversationState:
"""Add a turn to conversation; if history grows long, summarize."""

state["turns"].append({
"role": "user",
"content": user_message,
})
state["turns"].append({
"role": "assistant",
"content": assistant_response,
})

# Count tokens in current history
history_tokens = client.messages.count_tokens(
model="claude-3-5-haiku-20241022",
messages=state["turns"],
).input_tokens

print(f"Conversation history: {len(state['turns'])} messages, {history_tokens} tokens")

# If history exceeds 3,000 tokens, summarize and reset
if history_tokens > 3000:
print("History too large; summarizing...")

# Summarize conversation into bullet points
history_text = "\n".join([
f"{turn['role'].upper()}: {turn['content'][:200]}"
for turn in state["turns"]
])

summary_response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=200,
messages=[
{
"role": "user",
"content": f"Summarize this conversation in 5–10 bullet points:\n{history_text}",
}
],
)

summary = summary_response.content[0].text
summary_tokens = client.messages.count_tokens(
model="claude-3-5-haiku-20241022",
messages=[{"role": "user", "content": summary}],
).input_tokens

print(f"Summary: {summary_tokens} tokens (was {history_tokens} tokens)")

# Reset history and store summary
state["turns"] = [] # Clear old turns
state["summary"] = summary
state["summary_token_count"] = summary_tokens

return state

def continue_conversation(
state: ConversationState,
user_message: str,
) -> tuple[str, ConversationState]:
"""Continue conversation using summary + recent turns."""

# Build context: summary + recent turns
messages = []
if state["summary"]:
messages.append({
"role": "user",
"content": f"[Previous conversation summary:\n{state['summary']}]\n\n",
})

# Add recent turns (last 5)
messages.extend(state["turns"][-10:])
messages.append({
"role": "user",
"content": user_message,
})

# Generate response
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=messages,
)

# Add to state and potentially compress
assistant_response = response.content[0].text
state = add_turn_and_compress(state, user_message, assistant_response)

return assistant_response, state

This pattern keeps conversation history under control: an 20-turn conversation is compressed from 2,000 tokens to 300 tokens (summary + recent 4 turns) once per ~6 turns, a 85% savings on history cost. The summary is lossy but preserves intent and context for future turns.

Prompt Caching for Static Context

If your prompt includes large static context (system message, few-shot examples, reference documents) that repeats across many requests, use prompt caching. Anthropic's Claude offers 90% token cost reductions for cached tokens after the first request. Here is an example:

import anthropic

client = anthropic.Anthropic()

# Large static system prompt + examples
SYSTEM_PROMPT = """You are a customer support specialist...
[1,000 words of instructions and examples]
"""

FAQ_DATABASE = """
Q: How do I reset my password?
A: Click 'Forgot Password' on the login page...
[5,000 tokens of FAQs]
"""

def answer_support_question_with_cache(question: str, user_id: str) -> str:
"""Use prompt caching to reduce cost of repeated FAQs."""

# First request: system + cache is charged at full rate
response_1 = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
},
{
"type": "text",
"text": FAQ_DATABASE,
"cache_control": {"type": "ephemeral"},
},
],
messages=[
{
"role": "user",
"content": question,
}
],
)

print(f"Request 1 usage: input={response_1.usage.input_tokens}, "
f"cache_creation={response_1.usage.cache_creation_input_tokens}")
# Expected: input_tokens = 6000, cache_creation_input_tokens = 6000

# Second request (within 5 min): cache hit, charged only for new query
response_2 = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
},
{
"type": "text",
"text": FAQ_DATABASE,
"cache_control": {"type": "ephemeral"},
},
],
messages=[
{
"role": "user",
"content": question,
}
],
)

print(f"Request 2 usage: input={response_2.usage.input_tokens}, "
f"cache_read={response_2.usage.cache_read_input_tokens}")
# Expected: input_tokens = 10 (new query only), cache_read_input_tokens = 6000
# Cost: 10 tokens at normal rate + 6000 tokens at 90% discount = $0.03 + $0.18 = $0.21
# vs. both at normal rate = $18.06. Savings: 99%!

return response_2.content[0].text

# Simulate two support questions
answer_support_question_with_cache("How do I reset my password?", "user_1")
answer_support_question_with_cache("How do I update my billing address?", "user_2")

Caching is powerful when your system processes many requests with the same large context (FAQ database, product documentation, system instructions). If you run 100 questions against the same FAQ database, the first request costs $18 (6,000 tokens), the next 99 cost ~$0.02 each (only query tokens + 90% discount on cache reuse). Total: $20 vs. $1,800 without caching—a 99% savings.

Batch Compression and Preprocessing

For non-real-time workloads (data labeling, report generation, overnight analysis), compress prompts offline before sending to the API. Pre-chunk documents, run semantic deduplication (remove near-duplicate passages), and embed in vector stores before time-sensitive processing. This trades latency for cost: your batch job spends 1 hour compressing data, then processes 100 requests in 10 minutes at 80% lower cost.

Key Takeaways

  • RAG (retrieve-augmented generation) is the primary compression technique: fetch only relevant passages, reducing token count by 80–95%.
  • For conversations, summarize history periodically (e.g., every 6 turns) to keep token budget constant as conversation length grows.
  • Use prompt caching for static context (system prompts, FAQs, reference documents) that repeats across many requests—90% cost reduction on cached tokens.
  • Batch preprocessing (chunking, embedding, deduplication) enables aggressive offline compression for non-real-time workloads.
  • Compression requires investment in retrieval quality: measure precision and recall; tune retriever to achieve 95%+ accuracy.

Frequently Asked Questions

How do I know if my retriever is good enough to compress aggressively?

Test on a held-out set of 100–500 queries with ground-truth answers. Measure: (1) retrieval precision (what fraction of retrieved passages contain the answer?), (2) retrieval recall (does the top-K retrieval include the correct answer?), (3) end-to-end accuracy (does the LLM answer correctly using retrieved context?). Aim for 95%+ precision and 90%+ recall before aggressive compression.

Should I cache the entire document or retrieve passages?

Retrieve passages. Caching works well for static context (system prompts, reference docs) that repeats across requests, but you should not cache queries—each query is different. For documents, use RAG: embed once, retrieve per-query, cache rarely.

How often should I summarize conversation history?

Summarize every 5–10 turns or when history exceeds a token budget (e.g., 3,000 tokens). More frequent summarization preserves details but adds overhead; less frequent summarization is cheaper but risks losing context. Tune empirically based on quality metrics.

What if summarization loses important context?

Use a higher compression threshold (summarize less often) or keep a longer "recent turns" window (10–20 messages) alongside the summary. The summary covers high-level intent; recent messages preserve details for current questions. This hybrid approach balances cost and quality.

Can I combine RAG + caching + summarization?

Yes! A mature system uses all three: RAG for document Q&A, caching for static reference docs, and summarization for long conversations. The combination achieves 70–90% cost reduction versus naive approaches.

Further Reading