Skip to main content

Quality Filtering and Validation Techniques

Raw synthetic data from language models contains errors: hallucinated product names, malformed JSON, constraint violations, and semantic inconsistencies. Filtering removes these defects, improving dataset quality by 30–50% without regenerating data. A 2025 study by Hugging Face found that models trained on validated synthetic data achieve 8–12 percentage points higher accuracy than models trained on unfiltered synthetic data.

Automated Validation Layers

Build a multi-layer validation pipeline that checks data at different levels: format, schema, constraints, and semantic consistency.

Layer 1: Format and Schema Validation

Check that output is parseable and matches your expected schema:

import json
import jsonschema
from typing import List, Dict, Any

def validate_json_format(raw_text: str) -> tuple[bool, str, Dict[str, Any]]:
"""
Validate that raw text is valid JSON and optionally matches a schema.
Returns: (is_valid, error_message, parsed_dict)
"""
try:
parsed = json.loads(raw_text)
return True, "", parsed
except json.JSONDecodeError as e:
return False, f"Invalid JSON: {str(e)}", {}

def validate_schema(example: Dict[str, Any], schema: Dict[str, Any]) -> tuple[bool, str]:
"""
Validate that an example matches the expected JSON schema.
"""
try:
jsonschema.validate(instance=example, schema=schema)
return True, ""
except jsonschema.ValidationError as e:
return False, f"Schema violation: {e.message}"

# Example schema for customer support tickets
ticket_schema = {
"type": "object",
"properties": {
"ticket_id": {"type": "string", "pattern": "^TICKET-\\d{5}$"},
"severity": {"type": "string", "enum": ["Low", "Medium", "High", "Critical"]},
"description": {"type": "string", "minLength": 20, "maxLength": 500},
"created_at": {"type": "string", "format": "date-time"}
},
"required": ["ticket_id", "severity", "description", "created_at"],
"additionalProperties": False
}

# Validate a batch of examples
def validate_batch(examples: List[Dict], schema: Dict) -> tuple[List[Dict], List[str]]:
"""
Filter examples by schema and format.
Returns: (valid_examples, error_logs)
"""
valid_examples = []
errors = []

for i, example in enumerate(examples):
is_valid, error_msg = validate_schema(example, schema)
if is_valid:
valid_examples.append(example)
else:
errors.append(f"Example {i}: {error_msg}")

return valid_examples, errors

# Usage:
# valid, errors = validate_batch(generated_tickets, ticket_schema)
# print(f"Valid: {len(valid)}/{len(generated_tickets)} ({100*len(valid)/len(generated_tickets):.1f}%)")

Typically 85–95% of LLM-generated examples pass format validation. Failures are usually JSON parse errors or missing required fields.

Layer 2: Business Logic and Constraint Validation

Check that examples satisfy domain-specific rules:

def validate_constraints(example: Dict[str, Any]) -> tuple[bool, List[str]]:
"""
Validate that example meets business logic constraints.
Returns: (is_valid, list_of_violations)
"""
violations = []

# Constraint 1: Ticket ID must be unique (check against a known set)
if example.get("ticket_id") in KNOWN_IDS:
violations.append(f"Duplicate ticket ID: {example['ticket_id']}")

# Constraint 2: Severity and description should match
# (High/Critical severity should have detailed descriptions)
severity = example.get("severity")
description = example.get("description", "")

if severity in ["High", "Critical"] and len(description) < 50:
violations.append(f"Severity {severity} requires description >= 50 chars, got {len(description)}")

# Constraint 3: No profanity or abusive language
forbidden_terms = ["idiot", "useless", "garbage"]
if any(term in description.lower() for term in forbidden_terms):
violations.append(f"Forbidden language detected in description")

# Constraint 4: Created timestamp must be recent
import datetime
created_at = datetime.datetime.fromisoformat(example.get("created_at", ""))
if created_at > datetime.datetime.now():
violations.append(f"Future timestamp: {created_at}")

return len(violations) == 0, violations

# Validate batch with constraint checking
def validate_with_constraints(examples: List[Dict]) -> tuple[List[Dict], Dict[str, int]]:
"""
Filter examples by constraints and track violation types.
"""
valid_examples = []
violation_counts = {}

for example in examples:
is_valid, violations = validate_constraints(example)
if is_valid:
valid_examples.append(example)
else:
for violation in violations:
# Track violation types for reporting
violation_type = violation.split(":")[0]
violation_counts[violation_type] = violation_counts.get(violation_type, 0) + 1

return valid_examples, violation_counts

# Usage:
# valid, violation_stats = validate_with_constraints(examples)
# print(f"Violations: {violation_stats}")

Constraint validation typically passes 80–92% of examples. Common failures: profanity slipping through, timestamp errors, semantic inconsistencies.

Layer 3: Semantic Plausibility Checks

Use simple heuristics to catch nonsensical outputs:

def validate_semantic_plausibility(example: Dict[str, Any]) -> tuple[bool, List[str]]:
"""
Check semantic coherence without expensive model calls.
"""
issues = []
description = example.get("description", "")

# Check 1: Generic/templated language (red flag)
generic_phrases = [
"as above",
"see previous",
"this is a test",
"lorem ipsum",
"placeholder"
]
if any(phrase in description.lower() for phrase in generic_phrases):
issues.append("Detected generic/template language")

# Check 2: Repetition (high repetition suggests low quality)
words = description.split()
if len(words) > 0:
unique_ratio = len(set(words)) / len(words)
if unique_ratio < 0.6: # <60% unique words suggests too much repetition
issues.append(f"High word repetition (unique ratio: {unique_ratio:.2%})")

# Check 3: Unrealistic numbers (e.g., prices that are extreme)
import re
prices = re.findall(r'\$(\d+(?:,\d{3})*(?:\.\d{2})?)', description)
if prices:
price_values = [float(p.replace(",", "")) for p in prices]
if any(p > 1000000 for p in price_values):
issues.append(f"Unrealistic price detected: {max(price_values)}")

# Check 4: Incoherent structure (description mentions non-existent entities)
if "undefined" in description.lower() or "null" in description.lower():
issues.append("Mentions undefined/null values (likely hallucination)")

return len(issues) == 0, issues

# Usage:
# is_plausible, issues = validate_semantic_plausibility(example)
# if not is_plausible:
# print(f"Semantic issues: {issues}")

Semantic validation catches 10–20% of examples that pass schema but are low-quality.

Human-in-the-Loop Validation

For high-stakes applications, sample-based human review improves confidence:

def human_validation_sample(all_examples: List[Dict], sample_size: int = 100) -> tuple[float, List[Dict]]:
"""
Get a representative sample for human review.

Args:
all_examples: Full list of validated examples
sample_size: Number to sample for human review

Returns:
approval_rate (0-1), rejected_examples list
"""

import random
sample = random.sample(all_examples, min(sample_size, len(all_examples)))

# In production, these go to a human reviewer via UI/API
# For now, simulate sampling: return the sample and await human feedback

print(f"Sample of {len(sample)} examples for human review:")
for i, ex in enumerate(sample[:3]): # Show first 3
print(f"\n{i+1}. {ex}")

# Placeholder: In production, humans would rate each example
approval_rate = 0.92 # Simulated: 92% approved
rejected_indices = [sample.index(ex) for ex in sample[:3]] # Simulate rejections

return approval_rate, [all_examples[i] for i in rejected_indices]

# Optional: Use model-based scoring instead of human review
def model_based_quality_scoring(example: Dict[str, Any], model_client) -> float:
"""
Use a separate LLM to rate example quality.
Returns: score from 0 (bad) to 1 (excellent)
"""
prompt = f"""Rate the quality of this customer support ticket on a scale of 0-1.
Consider: realism, clarity, specificity, and appropriateness.

Ticket:
{json.dumps(example, indent=2)}

Respond with only a decimal number between 0 and 1."""

response = model_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)

try:
score = float(response.content[0].text.strip())
return max(0, min(1, score)) # Clamp to [0, 1]
except:
return 0.5 # Default if parsing fails

Human review of a random sample (100–500 examples) typically identifies systematic failures in your generation pipeline worth fixing upstream.

Combined Validation Pipeline

Integrate all layers into a single pipeline:

def full_validation_pipeline(
raw_examples: List[str],
schema: Dict,
min_confidence: float = 0.80
) -> tuple[List[Dict], Dict[str, Any]]:
"""
Full multi-layer validation pipeline.

Returns:
valid_examples: List of examples passing all checks
summary: Statistics on validation
"""

summary = {
"input_count": len(raw_examples),
"format_pass": 0,
"schema_pass": 0,
"constraint_pass": 0,
"semantic_pass": 0,
"final_pass": 0
}

# Layer 1: Parse JSON
parsed_examples = []
for raw in raw_examples:
is_valid, _, parsed = validate_json_format(raw)
if is_valid:
parsed_examples.append(parsed)
summary["format_pass"] += 1

# Layer 2: Schema validation
schema_valid = []
for parsed in parsed_examples:
is_valid, _ = validate_schema(parsed, schema)
if is_valid:
schema_valid.append(parsed)
summary["schema_pass"] += 1

# Layer 3: Constraint validation
constraint_valid, _ = validate_with_constraints(schema_valid)
summary["constraint_pass"] = len(constraint_valid)

# Layer 4: Semantic checks
final_valid = []
for example in constraint_valid:
is_plausible, _ = validate_semantic_plausibility(example)
if is_plausible:
final_valid.append(example)
summary["semantic_pass"] += 1

summary["final_pass"] = len(final_valid)
summary["pass_rate"] = summary["final_pass"] / summary["input_count"]

return final_valid, summary

# Usage:
# valid, stats = full_validation_pipeline(raw_outputs, ticket_schema)
# print(f"Pass rate: {stats['pass_rate']:.1%} ({stats['final_pass']}/{stats['input_count']})")

Typical pass rates through the full pipeline: 70–85% of raw outputs. This means expect to generate ~15,000 examples to get 10,000 valid ones. Account for this overhead in your cost estimates.

Key Takeaways

  • Multi-layer validation improves dataset quality by 30–50% without regeneration.
  • Layer 1 (format): catches JSON/parsing errors (~5–15% failure rate).
  • Layer 2 (schema): catches structural violations (~5–10% failure rate).
  • Layer 3 (constraints): catches business logic violations (~8–15% failure rate).
  • Layer 4 (semantics): catches subtle quality issues (~10–20% failure rate).
  • Plan for 70–85% pass rate through full pipeline; generate accordingly.

Frequently Asked Questions

Should I fix invalid examples or just discard them?

Discard, don't fix. Fixing introduces bias (you're selectively modifying certain outputs, not others). Regenerate instead. If >15% fail, your prompt likely needs refinement—fix the prompt, not the outputs.

How do I know which validation layer is the bottleneck?

Track pass rates per layer in your summary statistics. If Layer 3 (constraints) is the bottleneck, your prompt isn't sufficiently constraining the model. If Layer 4 (semantics) is the bottleneck, the model is generating subtle hallucinations that require semantic understanding to detect.

Can I skip human review if my automated validation passes >95%?

For low-stakes applications (internal data, non-critical models), yes. For high-stakes (finance, healthcare, safety-critical), always do sample-based human review, even with 95%+ automated pass rate. Humans catch systematic issues machines miss.

What's the computational cost of validation?

Format/schema validation: negligible (microseconds per example). Constraint checks: milliseconds per example. Semantic checks: milliseconds per example. Model-based scoring: seconds per example (requires API call). Keep expensive checks for sampled review, not all examples.

Further Reading