How to source training data for fine-tuning
Training data sourcing is the critical first step: gathering raw examples that reflect the behaviors you want your model to learn. The best fine-tuned models are built on diverse, representative datasets. A customer support model trained only on billing questions will fail on technical support; a code model trained on Python alone will struggle with JavaScript. This article covers five sourcing strategies—production logs, public datasets, manual annotation, vendor APIs, and hybrid pipelines—and how to evaluate source quality.
Five Data Sourcing Strategies
Strategy 1: Production Logs and Historical Data
Your own production logs are the highest-fidelity source: real user input-output pairs already in production. For customer support, mine your ticketing system for past conversations and resolutions. For code generation, use GitHub commits and pull requests that have been approved. For recommendation systems, use logged user queries and the items they selected.
The advantage is authenticity: these examples represent actual use cases, not hypothetical ones. The disadvantage is cost and privacy: logs often contain personally identifiable information (PII), must be anonymized, and may include erroneous or incomplete interactions.
Use production logs when you have 6+ months of operational history and clear task labels. Query your logs with a filter for high-quality interactions—exclude cases the model got wrong or users abandoned. Here's a Python pattern to extract customer support examples:
import json
from datetime import datetime, timedelta
# Query your ticketing API for resolved tickets in the past 12 months
# Filter for tickets resolved by a human agent (ground truth)
tickets = query_ticketing_api(
status="resolved",
date_range=(datetime.now() - timedelta(days=365), datetime.now()),
rating_min=4 # High satisfaction only
)
examples = []
for ticket in tickets:
example = {
"instruction": ticket["customer_message"],
"response": ticket["agent_response"],
"source": "production_logs",
"quality_score": ticket["customer_satisfaction"]
}
examples.append(example)
# Write to JSONL
with open("sourced_examples.jsonl", "w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")
print(f"Extracted {len(examples)} examples from production logs")
Strategy 2: Public Datasets
Public datasets are freely available and reduce annotation burden. Popular sources include:
- StackExchange Data: 20M+ Q&A pairs across programming, science, and domain topics. Available via archive.org and HuggingFace.
- The Stack: 3.1TB of code from GitHub, de-duplicated and language-tagged. Available on HuggingFace.
- Common Crawl: 1B+ web pages that can be filtered by domain or topic.
- OpenWebText: 37GB of high-quality web text used to train GPT-2.
- Medical and Scientific Corpora: PubMed, arXiv, and bioRxiv abstracts for domain-specific fine-tuning.
Public datasets are low-cost but often noisy, out-of-domain, and require heavy filtering. Use them for initial exploration or as a baseline before fine-tuning on proprietary data. Always check the license (MIT, CC-BY, ODbL, etc.) to ensure legal use in your product.
from datasets import load_dataset
# Load StackExchange dataset from HuggingFace
dataset = load_dataset("fka/awesome-chat-datasets", name="stackoverflow")
# Filter to Python questions with high voting
filtered = [
ex for ex in dataset["train"]
if ex["tags"] and "python" in ex["tags"] and ex["score"] > 10
]
print(f"Loaded {len(filtered)} high-quality Python Q&A examples")
Strategy 3: Manual Annotation and Expert Labels
When production logs are unavailable and public data is off-domain, hire annotators to label examples. This is the gold standard for quality but expensive (~$0.50–$2 per labeled example depending on complexity).
Use platforms like Scale, Surge, Label Studio, or internal teams. Define clear labeling guidelines (a 1-2 page rubric) so annotators understand the task, provide example annotations, and do a pilot with 50 examples to catch ambiguity before labeling at scale.
For an instruction-following task, your guideline might be:
Write a response that (1) answers the instruction fully, (2) is concise (under 200 words), (3) uses the user's language and tone, (4) cites sources when relevant, and (5) avoids assumptions about the user's prior knowledge.
Manual annotation works best for specialized domains: medical diagnosis, legal document review, or industry-specific tasks where only experts can judge correctness. For general tasks (customer support, Q&A), a mix of public data and a small set of manually curated examples is cost-effective.
Strategy 4: Vendor APIs and Synthetic Augmentation
For initial datasets, use a strong base model (e.g., GPT-4, Claude 3.5 Sonnet) to generate examples, then filter and curate them. This is faster than manual annotation and produces diverse examples, though they lack the authenticity of production logs.
Here's a workflow:
- Write 50 seed examples manually.
- Use a strong LLM to generate 10–20 similar examples per seed (prompt variation).
- Manually review the generated examples; keep 80–90%.
- Combine with production logs and public data.
import anthropic
client = anthropic.Anthropic()
# Seed examples
seed_instructions = [
"How do I cancel my subscription?",
"What payment methods do you accept?",
"Can I get a refund?"
]
generated_examples = []
for seed in seed_instructions:
prompt = f"""
Given the following customer support question:
"{seed}"
Generate 3 variations of this question that a customer might ask instead.
Format: one question per line, no numbering.
"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
variations = message.content[0].text.strip().split("\n")
for var in variations:
if var.strip():
generated_examples.append({"instruction": var.strip()})
print(f"Generated {len(generated_examples)} synthetic examples from seeds")
Strategy 5: Hybrid Approach
The best datasets combine multiple sources:
- 50% production logs (ground truth, but incomplete).
- 20% public datasets (broad coverage, higher noise).
- 20% manually annotated (domain-specific, high quality).
- 10% synthetic augmentation (fill gaps, increase diversity).
This hybrid approach balances cost, authenticity, and coverage. A 1,000-example dataset might be:
- 500 extracted from 1 year of customer support tickets.
- 200 sampled from a public customer support corpus.
- 200 manually written by domain experts for edge cases.
- 100 generated by a strong LLM to cover underrepresented scenarios.
Evaluating Source Quality
Before using a data source, assess it on four dimensions:
Relevance: Does the data reflect your use case? A code model trained on Python will struggle with JavaScript. Use domain matching: filter by keywords, language tags, or topic classifiers.
Diversity: Does the data cover the full range of inputs your model will see in production? If 80% of your examples are for the happy path, the model will underfit on edge cases. Plot example distributions and oversample underrepresented clusters.
Quality: Are the labels correct and consistent? Sample 100 random examples and manually validate them. If more than 10% are incorrect, the source needs refinement.
Size: Is there enough data? Aim for at least 100 examples to detect overfitting. For specialized domains, 50 well-chosen examples can suffice. For broad, open-ended tasks, 1,000–10,000 examples are ideal.
Use a quality score matrix to compare sources:
| Source | Relevance | Diversity | Quality | Size | Cost |
|---|---|---|---|---|---|
| Production logs | 5/5 | 3/5 | 4/5 | 2,000 | $0 |
| StackExchange | 3/5 | 5/5 | 3/5 | 100K+ | $0 |
| Manual annotation | 5/5 | 4/5 | 5/5 | 200 | $500 |
| Claude 3.5 generated | 4/5 | 4/5 | 3/5 | 1,000 | $5 |
| Hybrid (weighted) | 4/5 | 4/5 | 4/5 | 3,200 | $505 |
Key Takeaways
- Production logs are high-fidelity but require anonymization; public datasets are low-cost but noisy; manual annotation is costly but high-quality.
- Use a hybrid approach: 50% logs, 20% public, 20% manually labeled, 10% synthetic.
- Always evaluate sources on relevance, diversity, quality, and size before committing to fine-tuning.
- For specialized domains, prioritize quality (small, expert-labeled dataset) over quantity.
- Synthetic augmentation is cheap and fast but should not exceed 20% of your training set without careful validation.
Frequently Asked Questions
How do I anonymize production data?
Use PII detection libraries (e.g., presidio for Python, textanalytics for cloud) to identify and mask names, email addresses, phone numbers, and payment info. Replace with placeholders like [NAME] or [EMAIL]. Manually review a sample to ensure sensitive info is fully masked.
Should I balance sources by equal count or by quality?
Balance by quality. A 100-example, high-quality subset will contribute more to model learning than a 900-example noisy subset. Use weighted sampling: assign higher sampling probability to sources with higher quality scores.
Can I use copyrighted text (e.g., from books or articles) for fine-tuning?
This is a legal gray area in 2026. Fair use arguments apply to research, but commercial use is riskier. For safety, focus on public domain works, Creative Commons licensed content, or data you own. Consult legal counsel for your jurisdiction.
How do I detect and handle out-of-distribution examples?
Train a simple classifier (logistic regression or random forest) on your known in-domain examples to predict domain membership. Run it on new data to flag outliers. Alternatively, use embedding-based methods: compute embeddings of all examples, find outliers with low centroid cosine similarity, and manually review.
What's the cost breakdown for sourcing 1,000 training examples?
Production logs: $0 (already yours). Public datasets: $0 to $100 (data cleaning). Manual annotation at $1/example: $500 to $1,000. Synthetic generation (100 examples at API rates): $5 to $20. Total for a 1,000-example hybrid dataset: $500–$1,500.