Skip to main content

Privacy Protection and PII Removal

Language models trained on real-world text sometimes generate realistic-looking personally identifiable information (PII) like names, emails, phone numbers, and addresses. Even though this data is synthetic, it resembles real PII and may inadvertently match actual individuals. Removing or masking PII ensures synthetic datasets are truly shareable, privacy-safe, and GDPR-compliant. A 2025 analysis by Privacy International found that 12% of unfiltered LLM-generated text contains realistic PII-like data that could enable de-anonymization attacks.

Categories of PII to Detect

PII TypeExamplesDetection MethodMasking Strategy
NamesJohn Smith, Maria Garcia, Liu WeiNER + name databasesReplace with [NAME] or generate fake names
Email addresses[email protected], [email protected]Regex + validationReplace with [EMAIL] or fake domain
Phone numbers(555) 123-4567, +44 20 1234 5678Regex + digit patternsReplace with [PHONE]
Social Security NumbersXXX-XX-1234, variationsSpecific regex patternReplace with [SSN] or masked version
Home addresses123 Main St, Boston MA 02115NER + address validationReplace with [ADDRESS] or city only
Credit card numbers4532-1234-5678-9010Luhn algorithm validationReplace with [CARD] or last 4 digits
Dates of birth1985-03-22, March 22, 1985Date parsing + age inferenceReplace with [DOB] or age range

Named Entity Recognition (NER) for PII Detection

Use pre-trained NER models to identify PII patterns:

import spacy
from typing import List, Tuple, Dict
import re

# Load pre-trained NER model
nlp = spacy.load("en_core_web_sm")

# Add specialized PII recognition
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Initialize Presidio (specialized PII detection)
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def detect_pii_with_nlp(text: str) -> List[Tuple[str, str, int, int]]:
"""
Detect PII entities in text using spaCy NER.

Returns: List of (entity_type, value, start_char, end_char)
"""

doc = nlp(text)
pii_entities = []

# Extract named entities (PERSON, ORG, etc.)
for ent in doc.ents:
if ent.label_ in ["PERSON", "ORG", "GPE"]: # GPE = geopolitical entity
pii_entities.append((
ent.label_,
ent.text,
ent.start_char,
ent.end_char
))

return pii_entities

def detect_pii_with_presidio(text: str) -> List[Dict]:
"""
Detect PII using Presidio, which recognizes:
- Email, phone, credit card, SSN, names, addresses, dates
"""

try:
results = analyzer.analyze(text=text, language="en")

pii_findings = []
for finding in results:
pii_findings.append({
"entity_type": finding.entity_type,
"value": text[finding.start:finding.end],
"start": finding.start,
"end": finding.end,
"score": finding.score # Confidence 0-1
})

return pii_findings
except Exception as e:
print(f"Presidio error: {e}")
return []

def detect_pii_regex(text: str) -> List[Dict]:
"""
Detect PII using regex patterns for common formats.
Good for specific high-confidence patterns.
"""

patterns = {
"email": (r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', "EMAIL"),
"phone": (r'(\+1[-.]?)?\(?(\d{3})\)?[-.]?(\d{3})[-.]?(\d{4})', "PHONE"),
"ssn": (r'\b\d{3}-\d{2}-\d{4}\b', "SSN"),
"credit_card": (r'\b\d{4}[-]?\d{4}[-]?\d{4}[-]?\d{4}\b', "CREDIT_CARD"),
"zipcode": (r'\b\d{5}(?:-\d{4})?\b', "ZIPCODE"),
}

findings = []
for pattern_name, (pattern, pii_type) in patterns.items():
for match in re.finditer(pattern, text):
findings.append({
"entity_type": pii_type,
"value": match.group(),
"start": match.start(),
"end": match.end(),
"score": 0.95 # High confidence for regex
})

return findings

# Usage:
# text = "John Smith's email is [email protected] and phone is (555) 123-4567"
# presidio_findings = detect_pii_with_presidio(text)
# regex_findings = detect_pii_regex(text)
# all_findings = presidio_findings + regex_findings

Masking and Anonymization Strategies

Strategy 1: Simple Replacement with Placeholders

Replace detected PII with generic placeholders:

def mask_pii_simple(text: str, pii_findings: List[Dict]) -> str:
"""
Replace PII with [TYPE] placeholders.

Example: "John Smith (555-123-4567) lives at 123 Main St"
becomes "[NAME] ([PHONE]) lives at [ADDRESS]"
"""

# Sort findings by position (reverse order to maintain indices)
sorted_findings = sorted(pii_findings, key=lambda x: x['start'], reverse=True)

result = text
for finding in sorted_findings:
placeholder = f"[{finding['entity_type']}]"
start, end = finding['start'], finding['end']
result = result[:start] + placeholder + result[end:]

return result

# Example:
# original = "Contact Sarah Jones at [email protected] or 415-555-0123"
# pii = detect_pii_with_presidio(original)
# masked = mask_pii_simple(original, pii)
# print(masked) # "Contact [NAME] at [EMAIL] or [PHONE]"

Strategy 2: Synthetic PII Generation (Masking with Plausible Values)

Replace PII with generated synthetic equivalents:

import random
import faker

fake = faker.Faker()

def mask_pii_synthetic(text: str, pii_findings: List[Dict]) -> str:
"""
Replace PII with plausible synthetic values (fake names, emails, etc).

Preserves realism: a model trained on [NAME] tokens learns nothing about
names, but training on fake names helps the model understand name patterns.
"""

pii_mapping = {} # Map original → synthetic for consistency

sorted_findings = sorted(pii_findings, key=lambda x: x['start'], reverse=True)
result = text

for finding in sorted_findings:
entity_type = finding['entity_type']
original_value = finding['value']

# Reuse same synthetic value if we've seen this PII before
if original_value not in pii_mapping:
if entity_type == "PERSON":
pii_mapping[original_value] = fake.name()
elif entity_type == "EMAIL":
pii_mapping[original_value] = fake.email()
elif entity_type == "PHONE":
pii_mapping[original_value] = fake.phone_number()
elif entity_type == "ADDRESS":
pii_mapping[original_value] = fake.address().replace('\n', ', ')
elif entity_type == "CREDIT_CARD":
pii_mapping[original_value] = f"****-****-****-{random.randint(1000, 9999)}"
else:
pii_mapping[original_value] = f"[{entity_type}]"

synthetic_value = pii_mapping[original_value]
start, end = finding['start'], finding['end']
result = result[:start] + synthetic_value + result[end:]

return result

# Example:
# original = "John Doe emailed [email protected] about order #12345"
# pii = detect_pii_with_presidio(original)
# masked = mask_pii_synthetic(original, pii)
# print(masked) # "Robert Smith emailed [email protected] about order #12345"

Synthetic PII masking preserves semantic information for model training while removing linkability to real individuals.

Complete PII Removal Pipeline

def remove_pii_pipeline(
examples: List[Dict],
text_fields: List[str],
masking_strategy: str = "synthetic"
) -> Tuple[List[Dict], Dict]:
"""
Full pipeline: detect PII across all text fields → mask → return cleaned examples

Args:
examples: List of example dicts
text_fields: Fields containing text to check (e.g., ['description', 'feedback'])
masking_strategy: 'placeholder' or 'synthetic'

Returns:
cleaned_examples, statistics
"""

stats = {
"input": len(examples),
"pii_findings": 0,
"examples_with_pii": 0,
"pii_by_type": {}
}

cleaned_examples = []

for example in examples:
example_has_pii = False
cleaned_example = example.copy()

for field in text_fields:
if field not in example:
continue

text = str(example[field])

# Detect PII (combine multiple detection methods)
pii_findings = detect_pii_with_presidio(text)
pii_findings += detect_pii_regex(text)

if pii_findings:
example_has_pii = True
stats["pii_findings"] += len(pii_findings)

for finding in pii_findings:
pii_type = finding['entity_type']
stats["pii_by_type"][pii_type] = stats["pii_by_type"].get(pii_type, 0) + 1

# Mask PII
if masking_strategy == "synthetic":
cleaned_example[field] = mask_pii_synthetic(text, pii_findings)
else:
cleaned_example[field] = mask_pii_simple(text, pii_findings)

if example_has_pii:
stats["examples_with_pii"] += 1

cleaned_examples.append(cleaned_example)

stats["pii_rate"] = f"{100 * stats['examples_with_pii'] / stats['input']:.1f}%"

return cleaned_examples, stats

# Usage:
# examples = [...]
# cleaned, stats = remove_pii_pipeline(examples, text_fields=['description', 'feedback'])
# print(f"Found PII in {stats['examples_with_pii']} examples ({stats['pii_rate']})")
# print(f"PII breakdown: {stats['pii_by_type']}")

Validation: Ensure PII is Actually Removed

After masking, validate that no residual PII remains:

def validate_pii_removal(examples: List[Dict], text_fields: List[str]) -> bool:
"""
Scan cleaned examples to ensure PII is truly gone.
"""

for example in examples:
for field in text_fields:
text = str(example.get(field, ""))

# Strict detection (high confidence only)
findings = analyzer.analyze(text=text, language="en")
high_confidence = [f for f in findings if f.score > 0.9]

if high_confidence:
print(f"WARNING: Possible residual PII found: {high_confidence}")
return False

print("PII removal validated: no residual PII detected.")
return True

Integration into Full Pipeline

def full_privacy_pipeline(
raw_examples: List[str],
schema: Dict,
text_fields: List[str],
masking_strategy: str = "synthetic"
) -> Tuple[List[Dict], Dict]:
"""
Full pipeline: validate → deduplicate → remove PII → return shareable dataset
"""

# Step 1: Validate
validated, val_stats = full_validation_pipeline(raw_examples, schema)

# Step 2: Deduplicate
deduplicated, dedup_stats = deduplicate_by_embedding(validated)

# Step 3: Remove PII
pii_cleaned, pii_stats = remove_pii_pipeline(
deduplicated,
text_fields=text_fields,
masking_strategy=masking_strategy
)

# Step 4: Validate removal
pii_removed_ok = validate_pii_removal(pii_cleaned, text_fields)

combined_stats = {**val_stats, **dedup_stats, **pii_stats, "pii_validation": pii_removed_ok}

return pii_cleaned, combined_stats

# Usage:
# final_dataset, summary = full_privacy_pipeline(
# raw_outputs,
# schema,
# text_fields=['description', 'feedback'],
# masking_strategy='synthetic'
# )
# print(f"Final dataset size: {len(final_dataset)}, PII-safe: {summary['pii_validation']}")

Key Takeaways

  • 10–15% of unfiltered LLM outputs contain realistic PII that risks privacy violations.
  • Combine Presidio (ML-based) with regex patterns for comprehensive detection.
  • Synthetic PII masking preserves semantic information better than placeholder masking.
  • Always validate that PII removal was successful before sharing datasets.
  • Integrate PII removal into your full pipeline as the final step before export.

Frequently Asked Questions

Should I use placeholder or synthetic masking for PII?

Synthetic masking is better for training datasets because models see realistic name patterns and language structures. Use placeholders only for internal tools where readability matters more than training utility.

What if the model generates PII that matches real people?

This is possible but statistically rare (estimated <0.1% in real-world scenarios). Masking all synthetic PII erases this risk entirely. For high-stakes applications, add a final manual review of 100 examples to confirm no real-person matches.

Does PII removal affect model performance?

No—models trained on synthetic-name examples learn name structure patterns just as well as real names. In fact, using varied synthetic names (rather than real names clustered by geography/ethnicity) can reduce bias in your trained model.

What about less obvious PII like hashtags or social media handles?

Presidio detects some of these. For domain-specific PII (product IDs, customer codes, internal acronyms), create custom regex patterns. The key is identifying what constitutes PII in your specific context.

Further Reading