Synthetic data generation for fine-tuning: Techniques
Synthetic data generation creates training examples programmatically instead of collecting them manually. A strong LLM can generate diverse, valid examples in seconds at a fraction of the cost of human annotation. Studies show that datasets blending 70–80% authentic examples with 20–30% high-quality synthetic examples outperform datasets with only authentic data on held-out tests. This article covers five synthetic generation techniques and best practices for validation.
Why Synthetic Data Works
Synthetic data generated by strong models (GPT-4, Claude 3.5 Sonnet) are often diverse and correct. A language model trained on billions of tokens understands the task distribution better than a domain expert might. The challenge is ensuring quality: synthetic data must be valid, diverse, and representative of real-world inputs.
A 2025 meta-analysis across 50 fine-tuning projects found:
- 50% authentic + 50% synthetic: 3–5% performance drop on held-out test.
- 70% authentic + 30% synthetic: 0–2% performance drop (within noise).
- 80% authentic + 20% synthetic: Slight improvement (0–2%) on narrow domains.
- 100% synthetic: 10–20% performance drop (diversity insufficient).
The sweet spot is 70–80% authentic, 20–30% synthetic.
Technique 1: Prompt Augmentation (Paraphrasing)
Given seed examples, generate variations of the instruction with the same response:
import anthropic
import json
client = anthropic.Anthropic()
def augment_examples_by_paraphrase(seed_examples, n_per_example=2):
"""Generate paraphrased instructions from seed examples."""
augmented = []
for seed in seed_examples:
instruction = seed["instruction"]
response = seed["response"]
prompt = f"""
Given this instruction:
"{instruction}"
Generate {n_per_example} alternative ways a user might phrase the same request.
Requirements:
- Preserve the original meaning and intent.
- Use different vocabulary and phrasing.
- Keep each variation under 100 words.
Format: one instruction per line (no numbering).
"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
variations = message.content[0].text.strip().split("\n")
for var in variations:
if var.strip():
new_example = {
"instruction": var.strip(),
"response": response,
"source": "synthetic_paraphrase"
}
augmented.append(new_example)
return augmented
# Example: augment 10 seed examples
seed_examples = [
{"instruction": "What is machine learning?", "response": "Machine learning is..."},
{"instruction": "How do I use async/await in Python?", "response": "Async/await allows..."}
]
augmented = augment_examples_by_paraphrase(seed_examples, n_per_example=3)
print(f"Generated {len(augmented)} synthetic examples from {len(seed_examples)} seeds")
Pros: Fast, cheap, preserves correctness (response unchanged). Cons: Limited diversity; all variations share the same response.
Technique 2: Back-Translation
Translate instruction to another language, then back to the original. Errors and rephrasing create natural variations:
def back_translate_example(example, intermediate_language="Spanish"):
"""Back-translate an instruction via an intermediate language."""
instruction = example["instruction"]
response = example["response"]
# Step 1: Translate to intermediate language
prompt_1 = f'Translate this to {intermediate_language}: "{instruction}"'
msg_1 = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt_1}]
)
intermediate = msg_1.content[0].text.strip()
# Step 2: Translate back to English
prompt_2 = f'Translate this from {intermediate_language} to English: "{intermediate}"'
msg_2 = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt_2}]
)
back_translated = msg_2.content[0].text.strip()
return {
"instruction": back_translated,
"response": response,
"source": f"synthetic_back_translation_{intermediate_language}"
}
# Example
example = {"instruction": "Explain neural networks", "response": "Neural networks are..."}
synthetic = back_translate_example(example, intermediate_language="French")
print(synthetic)
Pros: Creates diverse, natural rephrasing; response accuracy preserved. Cons: Translation errors; slower than paraphrasing.
Technique 3: In-Context Learning (Few-Shot Generation)
Provide a strong model with a few examples, then ask it to generate more:
def generate_examples_few_shot(seed_examples, n_to_generate=20):
"""Use few-shot prompting to generate new examples."""
# Create few-shot prompt
few_shot_examples = "\n".join([
f"Instruction: {ex['instruction']}\nResponse: {ex['response']}\n---"
for ex in seed_examples[:3]
])
prompt = f"""
You are a data generator. Given examples of instruction-response pairs, generate {n_to_generate} more examples that follow the same pattern and quality.
Examples:
{few_shot_examples}
Generate {n_to_generate} new instruction-response pairs. Format each as:
Instruction: [text]
Response: [text]
---
Ensure:
- Instructions are diverse and realistic.
- Responses are accurate and helpful.
- No duplicates of the seed examples.
"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2000,
messages=[{"role": "user", "content": prompt}]
)
output = message.content[0].text
# Parse generated examples
generated = []
pairs = output.split("---")
for pair in pairs:
lines = pair.strip().split("\n")
instruction, response = None, None
for line in lines:
if line.startswith("Instruction:"):
instruction = line.replace("Instruction:", "").strip()
elif line.startswith("Response:"):
response = line.replace("Response:", "").strip()
if instruction and response:
generated.append({
"instruction": instruction,
"response": response,
"source": "synthetic_few_shot"
})
return generated
seed = [
{"instruction": "What is recursion?", "response": "Recursion is a function that calls itself..."}
]
generated = generate_examples_few_shot(seed, n_to_generate=10)
print(f"Generated {len(generated)} examples via few-shot")
Pros: Highly flexible; model understands full task from examples. Cons: Requires careful prompt engineering; can be expensive for large scale.
Technique 4: Task-Specific Generation
For specialized tasks (code generation, SQL queries, regex), generate with constraints:
def generate_code_examples(task_description, language="Python", n=10):
"""Generate code examples with problem description and solution."""
prompt = f"""
Generate {n} programming exercises in {language} with solution code.
Task type: {task_description}
For each example, provide:
1. Problem description (2-3 sentences)
2. Example input/output if applicable
3. Complete, working solution code
Format each as:
Problem: [description]
Code: [solution]
---
"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=3000,
messages=[{"role": "user", "content": prompt}]
)
output = message.content[0].text
# Parse examples
examples = []
pairs = output.split("---")
for pair in pairs:
lines = pair.strip().split("\n")
problem, code = None, None
in_code = False
for line in lines:
if line.startswith("Problem:"):
problem = line.replace("Problem:", "").strip()
elif line.startswith("Code:"):
in_code = True
code_lines = [line.replace("Code:", "").strip()]
elif in_code:
code_lines.append(line)
if problem and code_lines:
examples.append({
"instruction": f"Write code: {problem}",
"response": "\n".join(code_lines),
"source": "synthetic_code_generation"
})
return examples
code_examples = generate_code_examples(
task_description="Fibonacci sequence with memoization",
language="Python",
n=5
)
print(f"Generated {len(code_examples)} code examples")
Pros: Task-specific; can generate complex structured examples. Cons: Requires domain expertise in prompt design.
Technique 5: Conditional Generation
Generate examples conditioned on attributes (e.g., sentiment, domain, difficulty):
def generate_conditional_examples(attributes, n_per_attribute=5):
"""Generate examples with specific attributes."""
attribute_specs = ", ".join([f"{k}={v}" for k, v in attributes.items()])
prompt = f"""
Generate {n_per_attribute} instruction-response pairs with these characteristics:
{attribute_specs}
Requirements:
- Each instruction must genuinely require the specified attributes.
- Responses should be high-quality and helpful.
- Preserve diversity across the examples.
Format:
Instruction: [text]
Response: [text]
---
"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1500,
messages=[{"role": "user", "content": prompt}]
)
# Parse examples (same pattern as few-shot)
output = message.content[0].text
examples = []
for pair in output.split("---"):
lines = pair.strip().split("\n")
instruction, response = None, None
for line in lines:
if line.startswith("Instruction:"):
instruction = line.replace("Instruction:", "").strip()
elif line.startswith("Response:"):
response = line.replace("Response:", "").strip()
if instruction and response:
example = {"instruction": instruction, "response": response}
example.update(attributes)
example["source"] = "synthetic_conditional"
examples.append(example)
return examples
# Generate technical support responses
tech_examples = generate_conditional_examples({
"domain": "technical_support",
"difficulty": "advanced",
"tone": "professional"
}, n_per_attribute=3)
# Generate beginner-friendly explanations
beginner_examples = generate_conditional_examples({
"domain": "education",
"difficulty": "beginner",
"tone": "encouraging"
}, n_per_attribute=3)
Pros: Precise control over example attributes; ensures diverse coverage. Cons: Requires explicit attribute definition.
Quality Validation for Synthetic Data
Always validate synthetic data before adding to training set:
def validate_synthetic_examples(examples, seed_examples=None):
"""Validate synthetic examples for quality and diversity."""
issues = {
"duplicates": [],
"too_short": [],
"too_long": [],
"too_similar_to_seed": []
}
# Check for duplicates within synthetic
instructions = [ex["instruction"] for ex in examples]
duplicates = [x for x in instructions if instructions.count(x) > 1]
if duplicates:
issues["duplicates"] = list(set(duplicates))
# Check length
for ex in examples:
if len(ex["instruction"].split()) < 3:
issues["too_short"].append(ex["instruction"][:50])
if len(ex["response"].split()) < 5:
issues["too_short"].append(ex["response"][:50])
if len(ex["instruction"]) > 500:
issues["too_long"].append(ex["instruction"][:50])
# Check similarity to seed (if provided)
if seed_examples:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
seed_embeddings = model.encode([ex["instruction"] for ex in seed_examples])
synthetic_embeddings = model.encode(instructions)
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(synthetic_embeddings, seed_embeddings)
for i, max_sim in enumerate(similarities.max(axis=1)):
if max_sim > 0.95:
issues["too_similar_to_seed"].append(
(examples[i]["instruction"][:50], max_sim)
)
# Report
print("Synthetic data quality report:")
print(f" Total examples: {len(examples)}")
print(f" Duplicates: {len(issues['duplicates'])}")
print(f" Too short: {len(issues['too_short'])}")
print(f" Too long: {len(issues['too_long'])}")
print(f" Too similar to seed: {len(issues['too_similar_to_seed'])}")
# Reject bad examples
valid = []
for ex in examples:
if (
len(ex["instruction"].split()) >= 3 and
len(ex["instruction"]) <= 500 and
len(ex["response"].split()) >= 5 and
ex["instruction"] not in [x[0][:50] for x in issues["too_similar_to_seed"]]
):
valid.append(ex)
print(f" Valid: {len(valid)}/{len(examples)} ({100*len(valid)/len(examples):.1f}%)")
return valid
Complete Workflow
def generate_balanced_dataset(seed_examples, target_size=1000):
"""Combine authentic and synthetic examples at optimal ratio."""
# Generate synthetic via multiple techniques
paraphrase = augment_examples_by_paraphrase(seed_examples, n_per_example=2)
few_shot = generate_examples_few_shot(seed_examples, n_to_generate=50)
# Combine
synthetic = paraphrase + few_shot
# Validate
synthetic = validate_synthetic_examples(synthetic, seed_examples)
# Blend: 70% authentic, 30% synthetic
authentic_count = int(target_size * 0.7)
synthetic_count = target_size - authentic_count
# Shuffle and sample
import random
authentic_sample = random.sample(seed_examples, min(authentic_count, len(seed_examples)))
synthetic_sample = random.sample(synthetic, min(synthetic_count, len(synthetic)))
balanced = authentic_sample + synthetic_sample
random.shuffle(balanced)
print(f"Final dataset: {len(authentic_sample)} authentic + {len(synthetic_sample)} synthetic = {len(balanced)} total")
return balanced
Key Takeaways
- Synthetic data is cost-effective for scaling datasets; 70–80% authentic + 20–30% synthetic is optimal.
- Use five techniques: paraphrasing, back-translation, few-shot generation, task-specific generation, and conditional generation.
- Always validate synthetic examples for duplicates, length, and diversity before training.
- Reject synthetic examples too similar to seed examples (avoid memorization).
- Combine authentic and synthetic data carefully; 100% synthetic datasets underperform.
Frequently Asked Questions
Should synthetic data use the same response as seed examples?
Yes for paraphrasing and back-translation (same intent, same response). For few-shot and task-specific generation, responses can be different (new tasks, new solutions). Mix both: ~50% rephrased seeds, ~50% new examples.
How do I detect if a synthetic example is too similar to a seed?
Use embedding-based similarity (cosine similarity > 0.95 = too similar). Manually review borderline cases (0.90–0.95).
Can I fine-tune on 100% synthetic data if I can't access authentic data?
Not recommended. 100% synthetic datasets typically underperform by 10–20% on held-out tests. Always try to collect or license some authentic data. If impossible, start with synthetic, then manually curate the best examples for iterative refinement.
How much does synthetic generation cost?
At 2026 Claude Sonnet prices (~$1/MTok), generating 1,000 examples costs $1–$5 (0.3–2M tokens depending on technique). Manual annotation costs $500–$2,000 for the same volume.
Should I disclose synthetic data in production?
It depends on your use case. For research, transparency is important. For production systems, the quality matters, not the source. If synthetic data passes validation and improves performance, use it.