Skip to main content

Tolerance-Based Assertions for LLM Output Validation

Tolerance-based assertions define acceptable ranges for LLM output metrics instead of demanding exact text matches. Rather than asserting response == "exactly this text", you assert length_of(response) < 200 or similarity_to_reference > 0.85. This article teaches you to design resilient tests that validate semantic quality without brittleness.

LLM outputs are inherently variable: the same question asked with temperature 0.7 might produce three slightly different phrasings, all correct. Exact-match assertions fail on legitimate variations. Tolerance-based assertions accept variation within acceptable bounds, freeing you to test quality without over-constraining.

Core Metrics for LLM Validation

What should you measure? Common metrics depend on your use case:

MetricUse CaseExample
LengthEnsure outputs fit constraints100 < len(response) < 500
Semantic similarityCheck if output matches intentsimilarity_score(response, reference) > 0.8
Keyword presenceVerify key concepts are covered"climate change" in response.lower()
JSON structureValidate structured outputsjson.loads(response); assert "id" in response
FactualityCheck against ground truth (harder)fact_checker.verify(response, facts_db) (external service)
Tone/StyleEnsure response matches brand voiceclassifier.predict_tone(response) == "professional"
Token countMonitor cost and latencytokens < 1000

Length and Structure Assertions

The simplest tolerance assertions check output properties:

import re
from anthropic import Anthropic

client = Anthropic(api_key="your-key")

def test_summary_length():
"""Ensure summaries fit the requested length."""

article = "The future of AI is bright. [... 2000 words ...]"

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
temperature=0.6,
messages=[
{
"role": "user",
"content": f"Summarize in 100-150 words:\n\n{article}"
}
]
)

summary = response.content[0].text
word_count = len(summary.split())

# Tolerance-based assertion
assert 80 < word_count < 180, f"Expected 100-150 words, got {word_count}"
# Allow 20-word buffer for variance in counting methods

def test_response_format():
"""Ensure response adheres to expected structure."""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
temperature=0.5,
messages=[
{
"role": "user",
"content": "List 3 benefits of Python in this format:\n1. [benefit]\n2. [benefit]\n3. [benefit]"
}
]
)

text = response.content[0].text

# Check structure (tolerance: allow minor variations like "- " vs "1. ")
lines = text.strip().split('\n')
assert len(lines) >= 3, f"Expected at least 3 items, got {len(lines)}"

# Check that items look like a list
list_item_pattern = r'^[\d\-\*\.][\.\)]?\s+'
list_items = [line for line in lines if re.match(list_item_pattern, line)]
assert len(list_items) >= 2, f"Expected list format, got {text}"

Semantic Similarity Assertions

For comparing outputs to a reference, use semantic similarity (e.g., embedding-based comparison):

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_similarity(text1, text2):
"""Compute cosine similarity between two texts using TF-IDF vectors."""
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
try:
vectors = vectorizer.fit_transform([text1, text2])
similarity = cosine_similarity(vectors[0:1], vectors[1:2])[0][0]
return float(similarity)
except ValueError:
# Not enough unique terms
return 0.0

def test_summary_similarity():
"""Ensure summary is semantically similar to original article."""

article = """
Photosynthesis is the process by which plants convert light energy into chemical energy.
Chlorophyll in plant leaves absorbs sunlight, triggering a series of reactions that
produce glucose and oxygen. This fundamental process supports most life on Earth.
"""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
temperature=0.5,
messages=[{"role": "user", "content": f"Summarize: {article}"}]
)

summary = response.content[0].text
similarity = semantic_similarity(article, summary)

# Tolerance: allow 70-90% similarity (exact match is ~1.0)
assert similarity > 0.7, f"Summary too dissimilar: {similarity:.2f} < 0.7"

For more robust semantic similarity, use embedding-based comparison (requires a model like OpenAI's embeddings or Sentence Transformers):

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def embedding_similarity(text1, text2):
"""Compute cosine similarity using embeddings."""
embed1 = model.encode(text1, convert_to_tensor=True)
embed2 = model.encode(text2, convert_to_tensor=True)
return float(
np.dot(embed1, embed2) / (np.linalg.norm(embed1) * np.linalg.norm(embed2))
)

def test_response_captures_intent():
"""Ensure response addresses the user's intent."""

user_question = "How do I optimize database queries?"
reference_answer = "Use indexes, query analysis tools, and batch operations."

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
temperature=0.6,
messages=[{"role": "user", "content": user_question}]
)

llm_answer = response.content[0].text

# Embedding-based similarity: more robust to paraphrasing
intent_similarity = embedding_similarity(user_question + " " + reference_answer, llm_answer)
assert intent_similarity > 0.75, f"Answer doesn't address intent: similarity {intent_similarity:.2f}"

Keyword and Coverage Assertions

Verify that responses include required concepts:

def test_covers_key_topics():
"""Ensure explanation covers required topics."""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
temperature=0.5,
messages=[
{
"role": "user",
"content": "Explain machine learning. Include: supervised learning, unsupervised learning, and neural networks."
}
]
)

text = response.content[0].text.lower()

required_concepts = ["supervised", "unsupervised", "neural"]
missing = [c for c in required_concepts if c not in text]

assert not missing, f"Missing topics: {missing}"

def test_code_has_comments():
"""Ensure generated code includes comments (quality assertion)."""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
temperature=0.3,
messages=[
{
"role": "user",
"content": "Write a Python function to sort a list. Include comments."
}
]
)

code = response.content[0].text

# Check for comments
comment_lines = [line for line in code.split('\n') if '#' in line]
comment_ratio = len(comment_lines) / len(code.split('\n')) if code else 0

assert comment_ratio > 0.15, f"Expected >15% comment lines, got {comment_ratio:.1%}"

Tone and Style Assertions

For brand voice consistency, classify output tone:

from textblob import TextBlob

def test_professional_tone():
"""Ensure response maintains professional tone."""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
temperature=0.5,
messages=[
{
"role": "user",
"content": "Write a professional email declining a meeting."
}
]
)

text = response.content[0].text
blob = TextBlob(text)

# Polarity: -1 (negative) to 1 (positive)
# Professional decline should be neutral to slightly positive
assert -0.3 < blob.sentiment.polarity < 0.5, \
f"Tone too negative/positive: {blob.sentiment.polarity:.2f}"

# Check for unprofessional words
unprofessional = ["lol", "gonna", "btw", "dunno"]
found = [w for w in unprofessional if w in text.lower()]
assert not found, f"Unprofessional language detected: {found}"

Building a Tolerance Assertion Framework

Create a reusable assertion utility:

from dataclasses import dataclass
from typing import Callable

@dataclass
class ToleranceAssertion:
"""Validates LLM output against tolerance criteria."""

name: str
check_fn: Callable[[str], float] # Returns a score
min_value: float = 0.0
max_value: float = 1.0
required: bool = True

class LLMValidator:
def __init__(self):
self.assertions = []

def add_assertion(self, assertion: ToleranceAssertion):
self.assertions.append(assertion)

def validate(self, response: str) -> dict:
"""Run all assertions, return results."""
results = {}
failures = []

for assertion in self.assertions:
try:
score = assertion.check_fn(response)
passed = assertion.min_value <= score <= assertion.max_value
results[assertion.name] = {
"score": score,
"passed": passed,
"min": assertion.min_value,
"max": assertion.max_value
}
if not passed and assertion.required:
failures.append(
f"{assertion.name}: {score:.2f} outside [{assertion.min_value}, {assertion.max_value}]"
)
except Exception as e:
results[assertion.name] = {"error": str(e)}
if assertion.required:
failures.append(f"{assertion.name}: {e}")

if failures:
raise AssertionError("Tolerance assertions failed:\n" + "\n".join(failures))

return results

# Usage
validator = LLMValidator()
validator.add_assertion(ToleranceAssertion(
name="length",
check_fn=lambda text: len(text.split()),
min_value=50,
max_value=300,
required=True
))
validator.add_assertion(ToleranceAssertion(
name="similarity",
check_fn=lambda text: embedding_similarity("Python reference", text),
min_value=0.7,
max_value=1.0,
required=True
))

response = client.messages.create(...)
results = validator.validate(response.content[0].text)
print(results)

Key Takeaways

  • Tolerance-based assertions validate output ranges (e.g., length, similarity) instead of exact text matches, allowing legitimate variation.
  • Key metrics: length, semantic similarity, keyword coverage, tone, structure, and factuality. Choose metrics matching your use case.
  • Use embedding-based similarity for robust semantic comparison; TF-IDF for lightweight checking.
  • Build a reusable validator framework to standardize tolerance checks across tests.
  • Combine tolerance assertions with snapshots: snapshots for exact behavior on fixed inputs, tolerances for flexibility on varied inputs.

Frequently Asked Questions

What similarity score threshold should I use?

It depends on your task. For summaries, 0.7–0.8 is typical. For Q&A, 0.75–0.9. Test on your data: compute similarities between your reference answers and LLM outputs, then pick a threshold that passes good responses and rejects bad ones.

How do I check factuality programmatically?

This is hard without external knowledge. Options: (1) Use a fact-checking API like TextRazor or Diffbot, (2) Compare against a curated fact database, (3) Use a secondary LLM to verify facts (e.g., Claude verifies Claude), (4) Measure overlap with authoritative sources. For critical applications, human review is still gold standard.

Should I mix tolerance assertions and snapshots?

Yes. Use snapshots for exact outputs on controlled inputs (e.g., "question Q with seed S should produce output E"). Use tolerance assertions for flexible inputs where variation is acceptable. In complex systems, combine both: snapshot + tolerance assertions for robustness.

How do I handle multi-lingual outputs?

Most semantic similarity metrics work cross-linguistically (embeddings often transfer). If your LLM can output multiple languages, test each language separately or use multilingual embeddings (e.g., sentence-transformers/multilingual-MiniLM-L12-v2).

Further Reading