Skip to main content

Data cleaning and deduplication for training sets

Data cleaning and deduplication are unglamorous but critical. A single corrupted example can cause training to fail silently; duplicate examples waste compute and overfit the model. Studies show that 20–30% of raw datasets contain errors, duplicates, or formatting issues. This article covers automated detection and repair strategies, removing personally identifiable information (PII), handling encoding errors, and measuring data quality before training.

Why Cleaning Matters

Fine-tuning trains on every example you provide. Noisy data teaches noisy behavior. A model fine-tuned on a dataset with 10% duplicates and 5% mislabeled examples will converge to lower accuracy, higher loss, and slower training. In production, you'll see:

  • Inconsistent responses (when the model learned conflicting behaviors).
  • Overfitting on edge cases (if duplicates concentrate on certain patterns).
  • Hallucinated facts (if training data contained false information).
  • Slower convergence (if noise slows learning).

A 2025 benchmark by OpenAI found that cleaning a 1,000-example dataset (removing 150 duplicates, fixing 40 formatting errors, and anonymizing 30 PII instances) improved validation accuracy by 7–12 percentage points—equivalent to adding 200–500 clean examples without cleaning.

Deduplication Strategies

Strategy 1: Exact String Matching

The simplest approach: hash each example and flag duplicates.

import json
from hashlib import sha256

def deduplicate_exact(filepath, output_filepath):
"""Remove exact duplicates using SHA256 hashing."""
seen = set()
unique_examples = []
duplicates = []

with open(filepath) as f:
for i, line in enumerate(f):
example = json.loads(line)
# Create a stable hash of the full example
content = json.dumps(example, sort_keys=True)
example_hash = sha256(content.encode()).hexdigest()

if example_hash not in seen:
seen.add(example_hash)
unique_examples.append(example)
else:
duplicates.append((i, example))

# Write deduplicated dataset
with open(output_filepath, "w") as f:
for ex in unique_examples:
f.write(json.dumps(ex) + "\n")

print(f"Removed {len(duplicates)} exact duplicates")
print(f"Kept {len(unique_examples)} unique examples")
return unique_examples, duplicates

examples, dupes = deduplicate_exact("dataset.jsonl", "dataset_deduplicated.jsonl")

Strategy 2: Semantic Similarity (Fuzzy Matching)

Two examples can be near-identical yet have minor differences (typos, rephrasing). Use embedding-based similarity:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def deduplicate_semantic(filepath, threshold=0.95):
"""Remove near-duplicates using semantic similarity."""
model = SentenceTransformer("all-MiniLM-L6-v2") # Fast, accurate embeddings

examples = []
with open(filepath) as f:
for line in f:
examples.append(json.loads(line))

# Compute embeddings of instructions
instructions = [ex["instruction"] for ex in examples]
embeddings = model.encode(instructions)

# Compute pairwise similarity
similarities = cosine_similarity(embeddings)

# Mark duplicates
seen_indices = set()
unique_examples = []

for i in range(len(examples)):
if i in seen_indices:
continue

unique_examples.append(examples[i])

# Find all similar examples (mark as duplicates)
for j in range(i + 1, len(examples)):
if j not in seen_indices and similarities[i][j] > threshold:
seen_indices.add(j)

print(f"Removed {len(examples) - len(unique_examples)} semantic duplicates (threshold={threshold})")
return unique_examples

# Use threshold=0.95 for high sensitivity (catch near-duplicates)
# Use threshold=0.98 for low sensitivity (only identical-meaning examples)
unique = deduplicate_semantic("dataset.jsonl", threshold=0.95)

Strategy 3: Near-duplicate detection via clustering:

For very large datasets (100K+ examples), embedding-based deduplication is expensive. Use clustering:

from sklearn.cluster import DBSCAN
import numpy as np

def deduplicate_clustering(filepath, eps=0.05):
"""Cluster examples and keep one from each cluster."""
model = SentenceTransformer("all-MiniLM-L6-v2")

examples = [json.loads(line) for line in open(filepath)]
instructions = [ex["instruction"] for ex in examples]
embeddings = model.encode(instructions)

# DBSCAN clustering: find dense groups (duplicates likely in same cluster)
clustering = DBSCAN(eps=eps, min_samples=1).fit(embeddings)
labels = clustering.labels_

# Keep one example per cluster
unique_examples = []
seen_clusters = set()

for i, label in enumerate(labels):
if label not in seen_clusters:
seen_clusters.add(label)
unique_examples.append(examples[i])

print(f"Removed {len(examples) - len(unique_examples)} duplicates (clusters={len(seen_clusters)})")
return unique_examples

Fixing Encoding and Formatting Errors

Encoding errors: Text may contain mojibake (garbled characters) from encoding mismatches.

import json
import ftfy # Python library for fixing text encoding

def fix_encoding_errors(filepath, output_filepath):
"""Fix encoding errors using ftfy."""
with open(filepath, encoding="utf-8", errors="replace") as f:
examples = [json.loads(line) for line in f]

fixed = 0
for ex in examples:
original_text = ex.get("instruction", "") + ex.get("response", "")

# Use ftfy to fix encoding
if "instruction" in ex:
ex["instruction"] = ftfy.fix_text(ex["instruction"])
if "response" in ex:
ex["response"] = ftfy.fix_text(ex["response"])

fixed += 1

with open(output_filepath, "w", encoding="utf-8") as f:
for ex in examples:
f.write(json.dumps(ex, ensure_ascii=False) + "\n")

print(f"Fixed encoding in {fixed} examples")
return examples

Format validation:

import json
from jsonschema import validate, ValidationError

def validate_and_fix_format(filepath, output_filepath):
"""Validate examples match schema; remove invalid ones."""
schema = {
"type": "object",
"properties": {
"instruction": {"type": "string", "minLength": 5},
"response": {"type": "string", "minLength": 5}
},
"required": ["instruction", "response"]
}

valid_examples = []
invalid_examples = []

with open(filepath) as f:
for i, line in enumerate(f):
try:
ex = json.loads(line)
validate(instance=ex, schema=schema)
valid_examples.append(ex)
except (json.JSONDecodeError, ValidationError) as e:
invalid_examples.append((i, line, str(e)))

with open(output_filepath, "w") as f:
for ex in valid_examples:
f.write(json.dumps(ex) + "\n")

print(f"Valid: {len(valid_examples)}, Invalid: {len(invalid_examples)}")
if invalid_examples[:3]:
print("Sample invalid examples:")
for i, line, error in invalid_examples[:3]:
print(f" Line {i}: {error}")

return valid_examples, invalid_examples

Removing PII (Personally Identifiable Information)

Never fine-tune on raw personal data. Use a PII detection library:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
import json

def remove_pii(text):
"""Detect and mask PII: names, emails, phone numbers, etc."""
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Detect PII
results = analyzer.analyze(text=text, language="en")

# Anonymize: replace with placeholders
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)

return anonymized.text

def clean_pii_in_dataset(filepath, output_filepath):
"""Remove PII from all examples."""
with open(filepath) as f:
examples = [json.loads(line) for line in f]

for ex in examples:
if "instruction" in ex:
ex["instruction"] = remove_pii(ex["instruction"])
if "response" in ex:
ex["response"] = remove_pii(ex["response"])

with open(output_filepath, "w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")

print(f"Cleaned PII from {len(examples)} examples")
return examples

Quality Metrics and Validation

Before finalizing your dataset, compute quality metrics:

def compute_quality_metrics(filepath):
"""Compute dataset quality metrics."""
examples = [json.loads(line) for line in open(filepath)]

metrics = {
"total_examples": len(examples),
"avg_instruction_tokens": np.mean([len(ex["instruction"].split()) for ex in examples]),
"avg_response_tokens": np.mean([len(ex["response"].split()) for ex in examples]),
"min_instruction_length": min(len(ex["instruction"]) for ex in examples),
"max_instruction_length": max(len(ex["instruction"]) for ex in examples),
"examples_with_pii": sum(1 for ex in examples if contains_pii(ex)),
"examples_with_special_chars": sum(1 for ex in examples if contains_special_chars(ex))
}

return metrics

def contains_pii(example):
"""Quick heuristic check for PII."""
text = example["instruction"] + example["response"]
# Check for email, phone, SSN patterns
import re
patterns = [
r"\b\d{3}-\d{2}-\d{4}\b", # SSN
r"\b\d{3}-\d{3}-\d{4}\b", # Phone
r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" # Email
]
for pattern in patterns:
if re.search(pattern, text):
return True
return False

Complete Cleaning Pipeline

def full_cleaning_pipeline(raw_filepath, output_filepath):
"""End-to-end cleaning pipeline."""
print("1. Fixing encoding errors...")
fix_encoding_errors(raw_filepath, "step1_encoded.jsonl")

print("2. Validating format...")
valid, invalid = validate_and_fix_format("step1_encoded.jsonl", "step2_formatted.jsonl")

print("3. Removing PII...")
clean_pii_in_dataset("step2_formatted.jsonl", "step3_pii_removed.jsonl")

print("4. Deduplicating (exact)...")
deduplicate_exact("step3_pii_removed.jsonl", "step4_exact_dedup.jsonl")

print("5. Deduplicating (semantic)...")
unique = deduplicate_semantic("step4_exact_dedup.jsonl", threshold=0.95)
with open(output_filepath, "w") as f:
for ex in unique:
f.write(json.dumps(ex) + "\n")

print(f"\nCleaning complete. Output: {output_filepath}")

full_cleaning_pipeline("raw_dataset.jsonl", "dataset_cleaned.jsonl")

Key Takeaways

  • Cleaning datasets removes 20–30% of errors, duplicates, and PII; can improve validation accuracy by 7–12%.
  • Use exact hashing for duplicates, semantic similarity for near-duplicates, and clustering for large datasets.
  • Always fix encoding errors, validate format, and remove PII before fine-tuning.
  • Compute quality metrics (token counts, special characters, PII presence) to establish baseline.
  • Run a complete cleaning pipeline: encoding, format validation, PII removal, exact deduplication, semantic deduplication.

Frequently Asked Questions

How strict should my deduplication threshold be?

Use 0.95 for high sensitivity (catch nearly identical examples). Use 0.98 for low sensitivity (only keep very unique examples). Start at 0.95 and manually review flagged duplicates to calibrate for your domain.

Can I partially clean a dataset?

Yes. If you have 10,000 examples and can only manually review 500, focus cleaning effort on the most critical steps: deduplication, PII removal, and format validation. Encoding fixes can be automated.

What if my dataset has legitimate duplicates (e.g., multiple correct answers for the same question)?

Keep them. Deduplication should only remove near-identical examples, not examples with the same instruction and different correct responses. Use semantic similarity carefully: cluster by instruction only, not by instruction+response.

How do I know if an encoding error is real or intentional?

Use the ftfy library and manually sample outputs. If the fixed text reads naturally, the error was likely encoding. If fixing creates nonsense, the original might be intentional (e.g., emoji, special symbols). Review edge cases manually.

Should I clean before or after train/val split?

Clean before splitting. Otherwise, you might remove duplicates from training but keep them in validation, skewing your evaluation.

Further Reading