Evaluating Synthetic Data Quality and Fairness
Synthetic data quality isn't a binary ("good" or "bad")—it's measured across multiple dimensions: fidelity (does it match real distributions?), coverage (does it represent the full data space?), and fairness (is it free of spurious demographic correlations?). Without systematic evaluation, you risk training models that achieve 95% validation accuracy but fail catastrophically in production. A 2025 AI Now Institute study found that 68% of organizations lack formal evaluation frameworks for synthetic data, leading to undetected bias and distribution mismatch in production deployments.
Five Core Quality Dimensions
1. Distributional Fidelity (Does Synthetic Data Match Real Data?)
Objective: Measure how closely synthetic data distributions resemble real data. This is the most important metric.
Kolmogorov-Smirnov (KS) Test — Compares two distributions with a single score (0 = identical, 1 = completely different):
import numpy as np
from scipy.stats import ks_2samp
from typing import List, Tuple
def evaluate_distribution_match(
synthetic_values: List[float],
real_values: List[float]
) -> Tuple[float, float]:
"""
Evaluate distributional fidelity using KS test.
Args:
synthetic_values: List of synthetic numeric values
real_values: List of real numeric values
Returns:
(ks_statistic, p_value)
- ks_statistic: 0-1, lower is better (0 = identical)
- p_value: Statistical significance (p < 0.05 = distributions differ significantly)
"""
ks_stat, p_val = ks_2samp(synthetic_values, real_values)
return ks_stat, p_val
# Example: Comparing synthetic vs. real customer ticket lengths
synthetic_lengths = [45, 87, 62, 120, 95, 78, 105, 69, 110, 88] # 10 synthetic examples
real_lengths = [48, 92, 58, 115, 98, 75, 108, 72, 112, 85] # 10 real examples
ks_stat, p_val = evaluate_distribution_match(synthetic_lengths, real_lengths)
print(f"KS Statistic: {ks_stat:.4f}") # Close to 0 is good
print(f"P-value: {p_val:.4f}") # > 0.05 means distributions are not significantly different
Interpretation:
- KS statistic
< 0.1: Excellent fidelity - KS statistic
0.1–0.2: Good fidelity - KS statistic
0.2–0.3: Acceptable but with noticeable shift - KS statistic
> 0.3: Poor fidelity, significant distribution mismatch
2. Semantic Coherence (Does Data Make Sense?)
Objective: Verify that examples are semantically plausible (not nonsense, contradictions, or hallucinations).
Manual Review: Sample 50–100 examples, check for:
- Logical coherence (all fields make sense together)
- Realistic language and structure
- No contradictions within examples
- No fabricated entities (product names, person names that don't exist in domain)
def evaluate_semantic_coherence(
examples: List[Dict],
review_sample_size: int = 100
) -> Dict:
"""
Framework for semantic coherence evaluation.
In production, this involves human review.
"""
import random
sample = random.sample(examples, min(review_sample_size, len(examples)))
criteria = {
"logically_coherent": 0, # All fields align logically
"realistic_language": 0, # Sounds like real data
"no_contradictions": 0, # No internal conflicts
"no_hallucinations": 0, # No invented entities
"domain_appropriate": 0 # Matches domain conventions
}
print("Manual Semantic Evaluation (sample):")
for i, ex in enumerate(sample[:5]):
print(f"\nExample {i+1}:")
print(f" {ex}")
# In production, human reviewers rate each criterion 0/1
# For now, this is a placeholder structure
# After human review, compute pass rate per criterion
# Example placeholder scores:
criteria["logically_coherent"] = 48 # 48 of 50 passed
criteria["realistic_language"] = 45
criteria["no_contradictions"] = 49
criteria["no_hallucinations"] = 47
criteria["domain_appropriate"] = 50
avg_pass_rate = np.mean(list(criteria.values())) / review_sample_size
return {
"criteria": criteria,
"avg_pass_rate": avg_pass_rate,
"assessment": "PASS" if avg_pass_rate > 0.85 else "FAIL"
}
3. Coverage (Does Data Represent All Scenarios?)
Objective: Ensure synthetic data covers edge cases and rare scenarios, not just common ones.
Diversity Clustering: Use k-means clustering on embeddings to measure how well examples span the space:
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
def evaluate_coverage(
examples: List[Dict],
text_field: str,
num_clusters: int = 10
) -> Dict:
"""
Evaluate coverage using clustering.
Logic: If examples are diverse, they'll cluster into many distinct groups.
If examples are homogeneous, they'll all cluster into 1-2 groups.
"""
model = SentenceTransformer('all-MiniLM-L6-v2')
# Get embeddings
texts = [str(ex.get(text_field, "")) for ex in examples]
embeddings = model.encode(texts, convert_to_numpy=True)
# Cluster
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
clusters = kmeans.fit_predict(embeddings)
# Calculate coverage score: what % of clusters have at least 1 example?
unique_clusters = len(np.unique(clusters))
coverage_score = unique_clusters / num_clusters
# Analyze cluster sizes (good coverage = balanced sizes, not 80% in one cluster)
cluster_sizes = np.bincount(clusters)
cluster_uniformity = 1 - (cluster_sizes.std() / cluster_sizes.mean())
return {
"coverage_score": coverage_score,
"num_clusters_used": unique_clusters,
"cluster_uniformity": cluster_uniformity,
"assessment": "GOOD" if coverage_score > 0.8 else "POOR"
}
4. Fairness and Bias Detection
Objective: Verify that synthetic data doesn't encode spurious demographic correlations or group disparities.
Demographic Parity Check: Compare representation across subgroups:
def evaluate_fairness(
examples: List[Dict],
protected_attribute: str, # e.g., "customer_gender"
target_attribute: str # e.g., "issue_severity"
) -> Dict:
"""
Check for disparate impact: Does the distribution of target_attribute
differ across protected_attribute groups?
Example: Is "High severity" complaint rate equal across genders?
"""
# Group by protected attribute
groups = {}
for ex in examples:
group_value = ex.get(protected_attribute)
if group_value not in groups:
groups[group_value] = []
groups[group_value].append(ex)
# For each group, compute distribution of target attribute
target_distributions = {}
for group_value, group_examples in groups.items():
target_values = [ex.get(target_attribute) for ex in group_examples]
# Count occurrences
from collections import Counter
distribution = Counter(target_values)
target_distributions[group_value] = distribution
# Calculate demographic parity difference
# (max % difference between any two groups)
# Example: If "High" severity is 25% of female examples but 30% of male examples
# Difference = 5 percentage points
all_targets = set()
for dist in target_distributions.values():
all_targets.update(dist.keys())
max_disparity = 0
disparities = {}
for target_val in all_targets:
percentages = []
for group_value, dist in target_distributions.items():
total = sum(dist.values())
pct = 100 * dist.get(target_val, 0) / max(total, 1)
percentages.append(pct)
disparity = max(percentages) - min(percentages)
disparities[target_val] = disparity
max_disparity = max(max_disparity, disparity)
return {
"disparities_by_target": disparities,
"max_disparity_pct": max_disparity,
"assessment": "FAIR" if max_disparity < 5 else "BIASED" # <5% difference is fair
}
# Example:
# examples = [...] # Your synthetic dataset
# fairness_eval = evaluate_fairness(examples, "customer_gender", "issue_severity")
# if fairness_eval["assessment"] == "BIASED":
# print(f"Warning: {fairness_eval['max_disparity_pct']:.1f}% disparity detected")
5. Generalization on Real Data (The True Test)
Objective: Train a model on synthetic data and measure accuracy on real held-out data.
def evaluate_generalization(
synthetic_dataset: List[Dict],
real_test_set: List[Dict],
model_type: str = "classifier"
) -> Dict:
"""
Train on synthetic, test on real. This is the ground truth evaluation.
High real-data accuracy = high-quality synthetic data
Low real-data accuracy = distribution mismatch or missing scenarios
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
# Example: Classify issues by severity from description
# (You'd adapt this to your actual task)
# Prepare synthetic training data
X_synthetic = [ex.get("description", "") for ex in synthetic_dataset]
y_synthetic = [ex.get("severity", "") for ex in synthetic_dataset]
# Prepare real test data
X_real = [ex.get("description", "") for ex in real_test_set]
y_real = [ex.get("severity", "") for ex in real_test_set]
# Vectorize
vectorizer = TfidfVectorizer(max_features=1000)
X_synthetic_vec = vectorizer.fit_transform(X_synthetic)
X_real_vec = vectorizer.transform(X_real)
# Train on synthetic
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_synthetic_vec, y_synthetic)
# Test on real
y_pred = clf.predict(X_real_vec)
accuracy = accuracy_score(y_real, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_real, y_pred, average="weighted")
return {
"model_type": model_type,
"train_set": "synthetic",
"test_set": "real",
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1": f1,
"assessment": "GOOD" if accuracy > 0.80 else "POOR"
}
Complete Evaluation Framework
def evaluate_synthetic_dataset_comprehensive(
synthetic_examples: List[Dict],
real_examples: List[Dict],
config: Dict
) -> Dict:
"""
Run all evaluations and produce a comprehensive report.
"""
results = {
"timestamp": datetime.now().isoformat(),
"synthetic_dataset_size": len(synthetic_examples),
"real_dataset_size": len(real_examples),
"evaluations": {}
}
# 1. Distribution matching (numeric field: description length)
synthetic_lengths = [len(ex.get("description", "")) for ex in synthetic_examples]
real_lengths = [len(ex.get("description", "")) for ex in real_examples]
ks_stat, p_val = evaluate_distribution_match(synthetic_lengths, real_lengths)
results["evaluations"]["distribution_fidelity"] = {
"ks_statistic": ks_stat,
"p_value": p_val,
"assessment": "PASS" if ks_stat < 0.2 else "FAIL"
}
# 2. Semantic coherence
semantic = evaluate_semantic_coherence(synthetic_examples, review_sample_size=50)
results["evaluations"]["semantic_coherence"] = semantic
# 3. Coverage
coverage = evaluate_coverage(synthetic_examples, text_field="description")
results["evaluations"]["coverage"] = coverage
# 4. Fairness
fairness = evaluate_fairness(synthetic_examples, "category", "severity")
results["evaluations"]["fairness"] = fairness
# 5. Generalization (if real test data is available)
if real_examples:
generalization = evaluate_generalization(synthetic_examples, real_examples)
results["evaluations"]["generalization_on_real"] = generalization
# Overall assessment
assessments = [
results["evaluations"].get(k, {}).get("assessment")
for k in results["evaluations"]
]
all_pass = all(a == "PASS" or a == "GOOD" for a in assessments if a)
results["overall_assessment"] = "READY FOR PRODUCTION" if all_pass else "NEEDS IMPROVEMENT"
return results
# Usage:
# report = evaluate_synthetic_dataset_comprehensive(synthetic_data, real_data, config)
# print(json.dumps(report, indent=2))
Evaluation Checklist
Use this checklist before deploying synthetic data:
DISTRIBUTION FIDELITY
☐ KS test on 5+ numeric fields: all < 0.15?
☐ Chi-square test on categorical fields: all p > 0.05?
☐ Quantile plots visually align (synthetic vs. real)?
SEMANTIC COHERENCE
☐ Manual review of 50 random examples
☐ No hallucinated entities (names, product codes)
☐ No template/placeholder language
☐ No contradictions within examples
COVERAGE
☐ Clustering shows >80% cluster utilization
☐ Cluster sizes balanced (no single dominant cluster)
☐ Edge cases present: extreme values, empty fields, rare categories
FAIRNESS
☐ Demographic parity difference < 5% across groups
☐ No spurious correlations (e.g., gender vs. issue type)
☐ Protected attributes well-represented
GENERALIZATION
☐ Model trained on synthetic achieves >85% accuracy on real test data
☐ No massive accuracy drop on rare real-world scenarios
☐ Confusion matrix balanced across classes
Key Takeaways
- Synthetic data quality is multidimensional: fidelity, coherence, coverage, fairness, and real-world generalization.
- KS test quantifies distribution mismatch; aim for < 0.2.
- Manual semantic review of 50–100 examples catches hallucinations and logical errors.
- Clustering-based coverage metrics identify homogeneity.
- Fairness audits prevent demographic disparities.
- Real-data generalization is the ultimate evaluation: does training on synthetic yield good real-world models?
Frequently Asked Questions
What's an acceptable pass rate for the manual review?
Aim for > 85% pass rate on each semantic criterion. If critical issues (hallucinations, contradictions) exceed 15%, regenerate with refined prompts rather than filtering.
How much real data do I need for generalization testing?
500–1000 examples per class minimum. More is better; 5% of your final training set is ideal.
Should I evaluate fairness if my domain isn't protected by law?
Yes. Fairness failures harm model robustness and increase deployment risk. Even if not legally required, audit for demographic disparities.
Can I automate semantic coherence evaluation without humans?
Partially. Use summarization models to detect incoherence, or fine-tune a classifier on human-labeled examples. But manual review of a sample is irreplaceable for catching subtle issues.
Further Reading
- Evaluating Synthetic Data: A Benchmark — Mostly AI, 2022
- Fairness and Bias in Machine Learning — Buolamwini & Gebru, FAccT, 2021
- Statistical Testing for Data Quality — Polyzotis et al., Google, 2019
- Synthetic Data Generation and Evaluation — Howe et al., ACM, 2020