Tuning Similarity Thresholds for Your Use Case
The similarity threshold in semantic caching is a tunable knob that controls the trade-off between cache hit rate (cost savings) and freshness (accuracy). Lower thresholds (0.90) increase hit rates but raise the risk of serving a wrong cached response (false positive). Higher thresholds (0.98) reduce false positives but miss cache opportunities. This article teaches you to measure both, conduct A/B tests with your users, and choose a threshold that balances your operational costs and quality targets.
In production, the threshold is often not a single value but a policy: different thresholds for different query categories, user segments, or even per-user preferences. You will learn data-driven methods to discover the right threshold for your domain.
Understanding False Positives and False Negatives
In semantic caching terminology:
- True positive (TP): Cache hit on a semantically equivalent query. Correct, desirable.
- False positive (FP): Cache hit, but the cached response is unrelated or wrong. Bad; breaks user trust.
- True negative (TN): Cache miss, and the cached response was actually unrelated. Correct.
- False negative (FN): Cache miss, but the cached response would have been correct. Cost opportunity lost.
Your domain determines how costly each error is:
- General Q&A, documentation: Moderate tolerance for FP. Users tolerate occasional wrong answers if they can quickly ask for clarification. Threshold
0.94–0.96. - Medical, legal, financial: Very low tolerance for FP. An incorrect answer can cause harm. Threshold
0.97–0.99. - High-cost LLM inference: High tolerance for FP. Recomputing is expensive; a slightly stale answer is better than re-calling GPT-4 at USD 0.03/1K tokens. Threshold
0.90–0.93.
Baseline Measurement: Null Threshold
Start with a simple baseline: disable the cache (or set threshold 0.00) and collect data. For 1,000+ queries, log:
- The query text.
- The LLM response.
- User feedback (was the response correct? On a 1–5 scale, useful or not?).
This baseline gives you ground truth: which queries are correctly answered by your LLM.
Example: Collecting baseline correctness data
def collect_baseline_data(queries: list[str], sample_size: int = 1000, llm_call=None) -> list[dict]:
"""
Collect correctness labels for a sample of queries.
In production, integrate with your user feedback system (thumbs up/down, ratings).
"""
import random
baseline_data = []
sample = random.sample(queries, min(sample_size, len(queries)))
for query in sample:
response = llm_call(query)
# Simulate collecting user feedback
# In real system, ask user: "Was this helpful?" 1-5 scale
correctness_score = rate_response_quality(query, response) # 0.0-1.0
baseline_data.append({
"query": query,
"response": response,
"correctness": correctness_score,
"tokens": len(response.split())
})
return baseline_data
def rate_response_quality(query: str, response: str) -> float:
"""
Heuristic scoring: measure response relevance.
In production, use human raters or gather user signals.
"""
query_words = set(query.lower().split())
response_words = set(response.lower().split())
# Rough heuristic: overlap between query and response terms
overlap = len(query_words & response_words) / len(query_words) if query_words else 0.0
# Expected: a good response reuses 40-70% of query terms plus new info
if 0.2 < overlap < 0.8:
return 0.8
elif overlap >= 0.8:
return 0.5 # Too much repetition, maybe not helpful
else:
return 0.3 # Very low overlap, possibly off-topic
return min(1.0, max(0.0, score))
Empirical Threshold Selection: Recall and Precision
Once you have baseline data, simulate different thresholds. For each threshold, measure:
- Recall: Of all correct responses in your baseline, what fraction do you serve from cache?
- Precision: Of all cached responses you serve, what fraction are correct?
Example: Threshold evaluation
import numpy as np
def evaluate_threshold(baseline_data: list[dict], cache_entries: list[dict],
threshold: float) -> dict:
"""
Simulate a threshold: for each baseline query, find the best cache match.
Measure recall and precision.
cache_entries = [{"embedding": ..., "response": ..., "correctness": ...}, ...]
baseline_data = [{"query": ..., "embedding": ..., "correctness": ...}, ...]
"""
correct_served = 0 # Cached responses that are correct
total_served = 0 # Total cached responses served
correct_available = sum(1 for b in baseline_data if b["correctness"] >= 0.7) # Correct responses in cache
correct_found = 0 # Correct responses we hit on cache
for baseline in baseline_data:
best_match = None
best_sim = threshold
# Search for best match in cache
for cache_entry in cache_entries:
sim = np.dot(baseline["embedding"], cache_entry["embedding"])
if sim > best_sim:
best_sim = sim
best_match = cache_entry
if best_match:
total_served += 1
if best_match["correctness"] >= 0.7:
correct_served += 1
# Check if we matched a correct response
if baseline["correctness"] >= 0.7 and best_match["correctness"] >= 0.7:
correct_found += 1
recall = correct_found / correct_available if correct_available > 0 else 0.0
precision = correct_served / total_served if total_served > 0 else 0.0
cache_hit_rate = total_served / len(baseline_data)
return {
"threshold": threshold,
"recall": recall,
"precision": precision,
"cache_hit_rate": cache_hit_rate,
"total_served": total_served,
"total_correct_served": correct_served
}
# Sweep thresholds from 0.85 to 0.99
thresholds = [0.85, 0.90, 0.92, 0.94, 0.96, 0.98]
results = [evaluate_threshold(baseline_data, cache_entries, t) for t in thresholds]
# Display results
print("Threshold | Recall | Precision | Hit Rate | Served")
for r in results:
print(f"{r['threshold']:.2f} | {r['recall']:.2%} | {r['precision']:.2%} | {r['cache_hit_rate']:.1%} | {r['total_served']}")
Expected output:
Threshold | Recall | Precision | Hit Rate | Served
0.85 | 72% | 78% | 35% | 350
0.90 | 58% | 86% | 22% | 220
0.92 | 48% | 91% | 18% | 180
0.94 | 35% | 94% | 12% | 120
0.96 | 20% | 97% | 7% | 70
0.98 | 10% | 99% | 3% | 30
From this data, you see the recall-precision tradeoff clearly. For this domain:
- 0.85: 78% precision is low (1 in 5 cached responses is wrong). Too aggressive.
- 0.90: 86% precision is reasonable; catches 58% of correct questions. Good balance.
- 0.94: 94% precision is high; only misses real opportunities. Conservative.
Choose based on your SLO. For general Q&A, 0.90–0.92 is typical. For high-stakes domains, 0.96+.
Cost-Benefit Analysis: Computing ROI at Each Threshold
Recall alone does not tell the full story; you must compute the cost impact.
Example: ROI calculation
def compute_roi(threshold_result: dict,
inference_cost_per_request: float = 0.003, # e.g., Claude 3.5 Sonnet
embedding_cost_per_request: float = 0.00001,
serving_cost_per_request: float = 0.0001, # Redis/vector DB lookup
requests_per_month: int = 1_000_000) -> dict:
"""
Compute cost savings for a threshold policy.
"""
monthly_requests = requests_per_month
cache_hits = int(monthly_requests * threshold_result["cache_hit_rate"])
cache_misses = monthly_requests - cache_hits
# Cost of cache hits: embedding + serving (no inference)
hit_cost = cache_hits * (embedding_cost_per_request + serving_cost_per_request)
# Cost of cache misses: embedding + inference + storage
miss_cost = cache_misses * (embedding_cost_per_request + inference_cost_per_request)
total_cost_with_cache = hit_cost + miss_cost
# Baseline cost (no caching)
baseline_cost = monthly_requests * inference_cost_per_request
# Savings
savings = baseline_cost - total_cost_with_cache
savings_percent = savings / baseline_cost if baseline_cost > 0 else 0.0
return {
"threshold": threshold_result["threshold"],
"monthly_requests": monthly_requests,
"cache_hits": cache_hits,
"cache_misses": cache_misses,
"hit_cost_usd": hit_cost,
"miss_cost_usd": miss_cost,
"total_cost_usd": total_cost_with_cache,
"baseline_cost_usd": baseline_cost,
"savings_usd": savings,
"savings_percent": savings_percent
}
# Example: Compute ROI for different thresholds
for result in results:
roi = compute_roi(result, inference_cost_per_request=0.003, requests_per_month=1_000_000)
print(f"Threshold {roi['threshold']:.2f}: {roi['cache_hits']:,} hits/month, "
f"${roi['savings_usd']:.2f} saved ({roi['savings_percent']:.1%})")
Sample output:
Threshold 0.85: 350,000 hits/month, $1,050.00 saved (34.8%)
Threshold 0.90: 220,000 hits/month, $660.00 saved (21.9%)
Threshold 0.94: 120,000 hits/month, $360.00 saved (11.9%)
Threshold 0.98: 30,000 hits/month, $90.00 saved (3.0%)
A/B Testing Thresholds in Production
Theory meets practice: run an A/B test with real users to confirm your threshold choice.
Example: Production A/B test design
class ThresholdABTest:
"""A/B test two thresholds with real users."""
def __init__(self, threshold_control: float = 0.94, threshold_variant: float = 0.90):
self.threshold_control = threshold_control
self.threshold_variant = threshold_variant
self.control_metrics = {"hits": 0, "likes": 0, "dislikes": 0}
self.variant_metrics = {"hits": 0, "likes": 0, "dislikes": 0}
def assign_user_to_cohort(self, user_id: str) -> str:
"""Consistently assign user to control or variant (50/50)."""
# Hash-based assignment: same user always gets same cohort
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
return "control" if hash_val % 2 == 0 else "variant"
def get_threshold(self, user_id: str) -> float:
"""Return the threshold for this user's cohort."""
cohort = self.assign_user_to_cohort(user_id)
return self.threshold_control if cohort == "control" else self.threshold_variant
def record_feedback(self, user_id: str, feedback: str, was_cached: bool):
"""Record user feedback (like/dislike) for this response."""
cohort = self.assign_user_to_cohort(user_id)
metrics = self.control_metrics if cohort == "control" else self.variant_metrics
if was_cached:
metrics["hits"] += 1
if feedback == "like":
metrics["likes"] += 1
elif feedback == "dislike":
metrics["dislikes"] += 1
def compute_results(self) -> dict:
"""Compute statistical significance after 2-4 weeks of data."""
control_satisfaction = (
self.control_metrics["likes"] /
(self.control_metrics["likes"] + self.control_metrics["dislikes"])
if (self.control_metrics["likes"] + self.control_metrics["dislikes"]) > 0
else 0.0
)
variant_satisfaction = (
self.variant_metrics["likes"] /
(self.variant_metrics["likes"] + self.variant_metrics["dislikes"])
if (self.variant_metrics["likes"] + self.variant_metrics["dislikes"]) > 0
else 0.0
)
return {
"control": self.control_metrics,
"variant": self.variant_metrics,
"control_satisfaction": control_satisfaction,
"variant_satisfaction": variant_satisfaction,
"winner": "variant" if variant_satisfaction > control_satisfaction else "control"
}
Key Takeaways
- Threshold choice is empirical: measure baseline correctness, simulate each threshold, compute recall/precision/ROI.
- Recall-precision tradeoff is fundamental: lower threshold = more hits, lower precision (more false positives).
- Cost-benefit analysis must factor in inference costs, embedding costs, and monthly query volume; a 1% increase in hit rate can save millions/year at scale.
- A/B test your threshold choice with real users for 2–4 weeks; satisfaction metrics (user feedback) are the ground truth.
Frequently Asked Questions
Should I use the same threshold for all users?
Start with one global threshold. As you grow, consider segmenting by query category (factual vs. analytical) or user profile (power users vs. casual). Threshold 0.94 for factual, 0.90 for analytical may outperform a single global threshold.
What if my baseline data is small (< 100 queries)?
Use caution: results may be noisy. Increase sample size to 500+ queries before finalizing thresholds. Or run a longer A/B test to gather sufficient data.
Can I auto-tune the threshold dynamically?
Yes, using bandit algorithms (epsilon-greedy, UCB). Start with a global threshold, and gradually shift toward the variant with higher user satisfaction. Requires 2–3 weeks of data.
What if I observe precision dropping below 80%?
That threshold is too aggressive for most domains. Raise it by 0.02–0.05 and re-test. A false-positive rate >20% will hurt user trust and adoption.
How often should I re-tune my threshold?
Quarterly is a good cadence. Re-evaluate if you change LLM models, query distribution shifts significantly, or user feedback trends downward.
Further Reading
- Precision-Recall Trade-Offs in Information Retrieval — Foundational concepts.
- A/B Testing Best Practices (Optimizely) — Practical guide for running production experiments.
- Threshold Optimization in Ranking Systems (Google, 2019) — Techniques from the largest-scale retrieval systems.
- User Satisfaction Metrics in NLP (ACL, 2023) — Measuring quality of language model outputs.