Skip to main content

Deduplication Strategies for Synthetic Data

Language models sometimes generate very similar examples, especially for common scenarios. A customer support dataset might contain 50 near-identical "login not working" tickets with minor paraphrasing—redundancy that wastes storage and biases model training toward common patterns. Deduplication removes these semantic near-duplicates, improving dataset diversity by 15–30% without additional generation. A 2025 benchmark by Meta AI found that deduplicated synthetic datasets improve model generalization by 3–7 percentage points on held-out real data.

Why Exact Deduplication Isn't Enough

Exact string matching (if example1 == example2) catches only identical outputs. Language models rarely generate byte-for-byte duplicates, but they do generate semantically equivalent examples with minor variations:

  • "Login failed: invalid credentials" vs. "Invalid login: bad credentials"
  • "Page takes 30 seconds to load" vs. "Site is slow, takes ~30 sec"
  • "Button is missing from dashboard" vs. "Can't find button on dashboard"

Exact matching misses these. You need semantic deduplication, which identifies examples that convey the same information despite different wording.

Embedding-Based Deduplication

Convert examples to vector representations, cluster similar vectors, and keep only one example per cluster:

import numpy as np
from sklearn.cluster import DBSCAN
from typing import List, Dict, Tuple
import anthropic

client = anthropic.Anthropic()

def get_embeddings(texts: List[str]) -> np.ndarray:
"""
Get embedding vectors for a batch of texts using Claude's embedding model.

Returns: numpy array of shape (len(texts), embedding_dim)
"""

# In production, use a dedicated embedding model (e.g., all-MiniLM-L6-v2)
# For this example, we use Claude's text embedding capability

embeddings = []
for text in texts:
# Note: Claude API does not have a native embedding endpoint in all regions
# Use alternative: OpenAI, Sentence Transformers, or Hugging Face embeddings
# For this example, simulating with placeholder

# In practice:
# response = client.messages.create(
# model="text-embedding-3-large",
# input=text
# )
# embeddings.append(response.data[0].embedding)

pass

return np.array(embeddings)


def deduplicate_by_embedding(
examples: List[Dict],
text_field: str,
similarity_threshold: float = 0.85,
batch_size: int = 100
) -> Tuple[List[Dict], Dict[str, int]]:
"""
Deduplicate examples based on semantic embedding similarity.

Args:
examples: List of example dicts
text_field: Name of text field to embed (e.g., 'description')
similarity_threshold: Cosine similarity threshold (0-1); higher = stricter
batch_size: Process embeddings in batches for memory efficiency

Returns:
deduplicated_examples: List after deduplication
stats: Deduplication statistics
"""

if not examples:
return [], {"input": 0, "output": 0, "removed": 0}

# Extract texts
texts = [ex.get(text_field, "") for ex in examples]

# Get embeddings (using a real embedding model in production)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') # Fast, lightweight
embeddings = model.encode(texts, convert_to_numpy=True)

# Use DBSCAN clustering with cosine distance
# Cluster similar embeddings, keep one per cluster
from sklearn.metrics.pairwise import cosine_distances

distance_matrix = cosine_distances(embeddings)
# Convert distance to similarity: similarity = 1 - distance
similarity_matrix = 1 - distance_matrix

# Set diagonal to 1 (example is identical to itself)
np.fill_diagonal(similarity_matrix, 1.0)

# Greedy deduplication: iterate through sorted similarities, mark duplicates
kept_indices = []
removed_indices = set()

for i in range(len(examples)):
if i in removed_indices:
continue

kept_indices.append(i)

# Find all examples too similar to this one
for j in range(i + 1, len(examples)):
if j not in removed_indices and similarity_matrix[i, j] > similarity_threshold:
removed_indices.add(j)

deduplicated = [examples[i] for i in kept_indices]

stats = {
"input": len(examples),
"output": len(deduplicated),
"removed": len(removed_indices),
"removal_rate": f"{100 * len(removed_indices) / len(examples):.1f}%"
}

return deduplicated, stats


# Usage example:
# tickets = [...] # List of generated tickets
# deduplicated_tickets, stats = deduplicate_by_embedding(
# tickets,
# text_field='description',
# similarity_threshold=0.87
# )
# print(f"Removed {stats['removed']} near-duplicates ({stats['removal_rate']})")

Similarity threshold tuning:

  • 0.95–1.0: Only removes near-identical examples (very conservative)
  • 0.85–0.90: Removes paraphrased near-duplicates (recommended for most use cases)
  • 0.70–0.85: Removes semantically similar examples even if worded differently (aggressive, may over-deduplicate)

Typical deduplication rates: 5–20% of examples removed depending on threshold.

Fuzzy String Matching for Structured Fields

For structured data (customer names, product IDs, addresses), use fuzzy matching:

from fuzzywuzzy import fuzz
from typing import List, Dict, Set

def fuzzy_match_deduplication(
examples: List[Dict],
matching_fields: List[str],
similarity_threshold: int = 90
) -> Tuple[List[Dict], Dict]:
"""
Deduplicate based on fuzzy string matching on specific fields.
Useful for structured data with typos or slight variations.

Args:
examples: List of examples
matching_fields: Fields to consider for matching (e.g., ['customer_name', 'issue_type'])
similarity_threshold: Fuzz threshold (0-100); higher = stricter

Returns:
deduplicated_examples, statistics
"""

kept_indices = []
removed_indices: Set[int] = set()

for i in range(len(examples)):
if i in removed_indices:
continue

kept_indices.append(i)

# Compare to all subsequent examples
for j in range(i + 1, len(examples)):
if j in removed_indices:
continue

# Compute fuzzy similarity across all matching fields
field_similarities = []
for field in matching_fields:
val_i = str(examples[i].get(field, ""))
val_j = str(examples[j].get(field, ""))

similarity = fuzz.token_sort_ratio(val_i, val_j)
field_similarities.append(similarity)

# Average similarity across fields
avg_similarity = np.mean(field_similarities)

if avg_similarity >= similarity_threshold:
removed_indices.add(j)

deduplicated = [examples[i] for i in kept_indices]

return deduplicated, {
"input": len(examples),
"output": len(deduplicated),
"removed": len(removed_indices)
}

# Usage:
# tickets = [...]
# deduplicated, stats = fuzzy_match_deduplication(
# tickets,
# matching_fields=['customer_name', 'issue_type'],
# similarity_threshold=88
# )

Contextual Deduplication for Multi-Field Examples

For complex examples with multiple fields (tickets with description + metadata), use weighted field matching:

def weighted_deduplication(
examples: List[Dict],
field_weights: Dict[str, float],
similarity_threshold: float = 0.85
) -> Tuple[List[Dict], Dict]:
"""
Deduplicate considering weighted importance of different fields.

Args:
examples: List of examples
field_weights: Dict like {'description': 0.6, 'severity': 0.3, 'category': 0.1}
similarity_threshold: Weighted similarity threshold (0-1)

Returns:
deduplicated_examples, statistics
"""

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

kept_indices = []
removed_indices: Set[int] = set()

for i in range(len(examples)):
if i in removed_indices:
continue

kept_indices.append(i)

for j in range(i + 1, len(examples)):
if j in removed_indices:
continue

weighted_similarity = 0

for field, weight in field_weights.items():
text_i = str(examples[i].get(field, ""))
text_j = str(examples[j].get(field, ""))

# For short fields (category, severity), use exact match
if len(text_i) < 30:
field_sim = 1.0 if text_i.lower() == text_j.lower() else 0.0
# For long fields, use embedding similarity
else:
emb_i = model.encode(text_i, convert_to_tensor=False)
emb_j = model.encode(text_j, convert_to_tensor=False)
field_sim = float(np.dot(emb_i, emb_j) / (
np.linalg.norm(emb_i) * np.linalg.norm(emb_j) + 1e-8
))

weighted_similarity += weight * field_sim

if weighted_similarity >= similarity_threshold:
removed_indices.add(j)

return [examples[i] for i in kept_indices], {
"input": len(examples),
"output": len([examples[i] for i in kept_indices]),
"removed": len(removed_indices)
}

# Usage:
# field_weights = {
# 'description': 0.5, # Most important
# 'severity': 0.25,
# 'category': 0.15,
# 'created_at': 0.1
# }
# deduplicated, stats = weighted_deduplication(tickets, field_weights)

Deduplication at Scale

For datasets with millions of examples, full pairwise comparison is expensive. Use approximation:

def scalable_deduplication(
examples: List[Dict],
text_field: str,
num_hashes: int = 10,
band_size: int = 5
) -> Tuple[List[Dict], Dict]:
"""
Use locality-sensitive hashing (LSH) for fast deduplication at scale.
Approximate but much faster than full pairwise comparison.

Args:
examples: List of examples (millions ok)
text_field: Field to deduplicate on
num_hashes: Number of hash functions to use
band_size: Size of bands for LSH

Returns:
deduplicated_examples, statistics
"""

from datasketch import MinHash, MinHashLSH

# Create LSH index
lsh = MinHashLSH(num_perm=num_hashes, threshold=0.5)

# Create MinHash signatures for each example
minhashes = {}
for i, example in enumerate(examples):
text = str(example.get(text_field, ""))

# Create MinHash from text
m = MinHash(num_perm=num_hashes)
for token in text.split():
m.update(token.encode('utf8'))

minhashes[i] = m
lsh.insert(str(i), m)

# Find duplicates
duplicates: Set[int] = set()
for i in range(len(examples)):
if i in duplicates:
continue

# Query for similar examples
similar = lsh.query(minhashes[i])
for similar_id in similar:
j = int(similar_id)
if j > i:
duplicates.add(j)

kept_indices = [i for i in range(len(examples)) if i not in duplicates]
deduplicated = [examples[i] for i in kept_indices]

return deduplicated, {
"input": len(examples),
"output": len(deduplicated),
"removed": len(duplicates)
}

# Usage for 1 million examples:
# large_dataset = [...]
# deduplicated, stats = scalable_deduplication(large_dataset, 'description')
# print(f"Processed {stats['input']} examples, removed {stats['removed']} duplicates")

Integration into Pipeline

Add deduplication as a post-validation step:

def full_dedup_pipeline(
raw_examples: List[str],
schema: Dict,
dedup_threshold: float = 0.87
) -> Tuple[List[Dict], Dict]:
"""
Full pipeline: validate → deduplicate → return final dataset
"""

# Step 1: Validate (as in article 5)
validated, val_stats = full_validation_pipeline(raw_examples, schema)

# Step 2: Deduplicate
deduplicated, dedup_stats = deduplicate_by_embedding(
validated,
text_field='description',
similarity_threshold=dedup_threshold
)

combined_stats = {
**val_stats,
**dedup_stats,
"dedup_rate": dedup_stats.get("removal_rate", "N/A"),
"final_count": len(deduplicated)
}

return deduplicated, combined_stats

# Example:
# final_dataset, summary = full_dedup_pipeline(raw_outputs, schema)
# print(f"Validation pass rate: {summary['pass_rate']:.1%}")
# print(f"Dedup removal rate: {summary['dedup_rate']}")
# print(f"Final dataset size: {summary['final_count']}")

Key Takeaways

  • Semantic deduplication removes 5–20% of examples, improving diversity without regeneration.
  • Embedding-based deduplication (cosine similarity) works best for text; threshold 0.85–0.90 is typical.
  • Fuzzy matching handles structured fields with typos (names, IDs).
  • Weighted deduplication balances importance of different fields in multi-field examples.
  • For millions of examples, use locality-sensitive hashing for speed.

Frequently Asked Questions

What similarity threshold should I use?

Start with 0.87 (reasonable default). If your validation shows excessive diversity loss, increase to 0.90+. If you suspect many near-duplicates slip through, decrease to 0.83–0.85. Tune empirically on a sample.

Should I deduplicate before or after validation?

After validation. Deduplicate only on validated examples; don't waste effort deduplicating invalid data that you'll discard anyway.

Can I combine embedding and fuzzy matching?

Yes. Use fuzzy matching for structured fields (category, product name) and embedding similarity for text fields. Weight the results: if both exact match on category and have high text similarity, mark as duplicate.

What embedding model should I use?

For production: all-MiniLM-L6-v2 (fast, 384 dimensions) or all-mpnet-base-v2 (slower, higher quality). Both are free and lightweight. For state-of-the-art quality, use OpenAI's text-embedding-3-large, but this adds API cost.

Further Reading