Building Golden Datasets for RAG Systems
A golden dataset is the foundation of RAG evaluation. It is a curated collection of queries, reference documents, and ground-truth answers annotated by domain experts and version-controlled like code. Without a golden dataset, you cannot measure whether your RAG changes improve or degrade quality. I learned this the hard way: after deploying what I thought was an improved retriever, users reported worse answers, but I had no benchmark to measure against.
Golden datasets serve two purposes: they enable offline evaluation (measuring metrics on historical data) and they enable regression testing (ensuring new changes do not break existing queries). A well-maintained golden dataset grows over time as you encounter edge cases and hard queries.
Annotation Schema Design
Before collecting data, design your annotation schema. Define what "relevant" and "correct" mean for your domain. For a medical RAG system, you might define a relevant document as "any source that directly addresses the query's clinical question." A correct answer is "one that accurately reflects the diagnosis, treatment, or guideline mentioned in relevant documents."
Your schema should include:
- Query format and allowed types (factual questions, multi-step reasoning, open-ended)
- Relevance definition (binary or graded: highly relevant, partially relevant, not relevant)
- Answer correctness criteria (factual accuracy, completeness, conciseness)
- Citation requirements (which passages must be cited in the answer)
from dataclasses import dataclass
from typing import List, Literal
from enum import Enum
class RelevanceLevel(str, Enum):
"""Relevance annotation levels."""
HIGHLY_RELEVANT = "highly_relevant"
PARTIALLY_RELEVANT = "partially_relevant"
NOT_RELEVANT = "not_relevant"
@dataclass
class GoldenExample:
"""Single golden example with full annotation."""
query_id: str
query: str
reference_documents: List[str] # Document IDs
reference_doc_relevances: List[RelevanceLevel] # Relevance per doc
ground_truth_answer: str
required_citations: List[str] # Passages that should be cited
query_difficulty: Literal["easy", "medium", "hard"]
domain: str # e.g., "medical", "legal", "technical"
annotator: str # Person who annotated
annotation_timestamp: str # ISO 8601 datetime
# Example schema in JSON
example_json = {
"query_id": "Q_001",
"query": "What are the treatment options for Type 2 diabetes?",
"reference_documents": ["doc_1234", "doc_5678", "doc_9012"],
"reference_doc_relevances": [
"highly_relevant",
"highly_relevant",
"not_relevant"
],
"ground_truth_answer": (
"Type 2 diabetes can be managed through lifestyle changes, "
"metformin as first-line medication, and GLP-1 agonists or "
"SGLT2 inhibitors for additional glycemic control."
),
"required_citations": [
"metformin as first-line medication",
"GLP-1 agonists",
"SGLT2 inhibitors"
],
"query_difficulty": "medium",
"domain": "medical",
"annotator": "Dr. Sarah Chen",
"annotation_timestamp": "2026-06-02T10:30:00Z"
}
Sampling Strategies for Diversity
A biased golden dataset (skewed toward easy queries, common topics) hides failure modes. Use stratified sampling to ensure diversity:
- Domain breadth: Sample evenly across your primary domains (medical, legal, technical).
- Query difficulty: Ensure some easy (factual lookup), medium (light reasoning), and hard (multi-hop reasoning) queries.
- Query type: Include factual questions, comparison questions, and open-ended questions.
- Negative examples: Include queries where the correct answer is "I don't know" or "insufficient information" (these are hard for RAG systems).
- Hard negatives: Include queries where your current system fails—these reveal gaps in retrieval or generation.
import json
from collections import defaultdict
def stratify_examples(examples: List[GoldenExample],
strata_fields: List[str],
target_per_stratum: int) -> List[GoldenExample]:
"""
Sample examples to ensure even distribution across strata.
Args:
examples: All candidate examples.
strata_fields: Fields to stratify by (e.g., ['domain', 'query_difficulty']).
target_per_stratum: Target count per stratum combination.
Returns:
Stratified subset ensuring diversity.
"""
strata_counts = defaultdict(int)
stratified = []
for example in examples:
# Create stratum key (e.g., ('medical', 'hard'))
stratum_key = tuple(getattr(example, field) for field in strata_fields)
if strata_counts[stratum_key] < target_per_stratum:
stratified.append(example)
strata_counts[stratum_key] += 1
return stratified
# Example: ensure 5 examples per (domain, difficulty) pair
examples = [...] # Load all candidates
stratified = stratify_examples(
examples,
strata_fields=['domain', 'query_difficulty'],
target_per_stratum=5
)
print(f"Stratified dataset size: {len(stratified)}")
Annotation Workflow and Quality Control
For small datasets (50–100 examples), use domain experts. For larger datasets (500+), use crowdsourcing with quality control:
- Write detailed annotation guidelines (2–3 pages) with examples of relevant vs. irrelevant documents.
- Create a "gold standard" set (20–50 examples) annotated by experts, used to filter crowdworkers.
- Run qualification tests: ask workers to annotate gold examples and only accept those with 90%+ agreement.
- Use inter-annotator agreement (Cohen's kappa or Fleiss' kappa) to measure schema clarity. Below 0.70 indicates ambiguity.
- Have a senior reviewer spot-check 10–20% of crowdsourced annotations to catch systematic errors.
import numpy as np
from itertools import combinations
def fleiss_kappa(annotations: np.ndarray) -> float:
"""
Compute Fleiss' kappa for multi-annotator agreement.
Args:
annotations: Array of shape (n_examples, n_annotators)
with values in {0, 1, ...} (class indices).
Returns:
Kappa score (-1 to 1; >=0.70 is good agreement).
"""
n_examples, n_annotators = annotations.shape
n_classes = int(np.max(annotations)) + 1
# Compute pairwise agreement
p_agreement = 0.0
for i in range(n_examples):
example_annotations = annotations[i]
p_j = np.bincount(example_annotations, minlength=n_classes) / n_annotators
p_agreement += np.sum(p_j ** 2)
p_agreement /= n_examples
# Compute chance agreement
p_chance = 0.0
overall_counts = np.bincount(annotations.flatten(), minlength=n_classes)
p_k = overall_counts / annotations.size
p_chance = np.sum(p_k ** 2)
kappa = (p_agreement - p_chance) / (1 - p_chance)
return kappa
# Example: 10 examples annotated by 3 annotators (0=not relevant, 1=relevant)
annotations = np.array([
[1, 1, 1], # High agreement
[1, 1, 0], # Slight disagreement
[0, 0, 0],
[1, 1, 1],
])
kappa = fleiss_kappa(annotations)
print(f"Fleiss' Kappa: {kappa:.3f}")
Version Control and Maintenance
Store your golden dataset in version control alongside your code. Use JSON or CSV format:
data/
golden_datasets/
medical_v1.0.json # Release version
medical_dev.json # Development version
.gitattributes # Mark data files for diff=data if supported
Maintain a changelog documenting:
- New examples added (count, date, why)
- Examples removed (count, date, reason)
- Annotation schema changes
{
"metadata": {
"version": "1.0.0",
"release_date": "2026-06-02",
"total_examples": 150,
"changelog": [
{
"date": "2026-06-01",
"action": "added",
"count": 20,
"reason": "hard negatives from production failures"
},
{
"date": "2026-05-15",
"action": "fixed",
"count": 5,
"reason": "corrected incorrect ground_truth_answer in examples Q_101–Q_105"
}
]
},
"examples": [
{
"query_id": "Q_001",
"query": "What is the capital of France?",
"reference_documents": ["doc_123", "doc_456"],
"reference_doc_relevances": ["highly_relevant", "not_relevant"],
"ground_truth_answer": "Paris is the capital of France.",
"required_citations": ["capital of France"],
"query_difficulty": "easy",
"domain": "geography",
"annotator": "Dr. Alice Smith",
"annotation_timestamp": "2026-05-15T12:00:00Z"
}
]
}
Key Takeaways
- A golden dataset is your ground truth for measuring RAG quality and detecting regressions.
- Design an explicit annotation schema that defines relevance, correctness, and citation requirements.
- Use stratified sampling to ensure diversity across domains, difficulty levels, and query types.
- For crowdsourced annotation, enforce quality gates (gold standard qualification tests, inter-annotator agreement checks, senior review).
- Version-control your dataset and maintain a changelog documenting all changes.
Frequently Asked Questions
How large should a golden dataset be?
Start with 50–100 examples for prototyping and local regression testing. For production systems, aim for 200–500 examples. Critical domains (medical, legal, financial) may need 1,000+ examples and multi-level annotation. Larger is better if annotation quality is maintained.
What if I don't have domain experts to annotate?
Start with an automated baseline: retrieve documents using BM25 or semantic search, rank them by relevance score, and use human annotators only to verify and correct. Alternatively, recruit domain experts part-time or contract a specialized annotation service (Scale AI, Prodigy, etc.).
Should I annotate multiple relevance grades or just binary relevant/not relevant?
Binary annotation is faster and results in higher inter-annotator agreement. Graded relevance (highly/partially/not relevant) provides richer signal but requires more careful annotation guidelines. For initial datasets, use binary; upgrade to graded if you observe low precision at shallow cutoffs.
How often should I update the golden dataset?
Add new examples when you encounter production failures or coverage gaps. Refresh the dataset every 3–6 months to capture domain drift. For fast-moving domains (news, legislation), update quarterly.
Further Reading
- Data Annotation Guidelines Best Practices (Prodigy Docs) — Annotation schema and quality control.
- Building NLP Datasets with Human Annotation (Scriptorium) — Crowdsourcing strategies for NLP annotation.
- Inter-Rater Reliability Handbook (Cohen, 1960) — Statistical methods for measuring annotation agreement.