Skip to main content

Quality assurance and auditing training data

Quality assurance (QA) auditing is the final checkpoint before fine-tuning. A comprehensive audit catches errors before they're baked into the model. Studies show that a 2-hour audit of 500 training examples prevents 60–80% of downstream production issues. This article covers manual spot-checking, consistency validation, bias auditing, and pre-training test runs to ensure dataset quality.

Why Audit Before Training?

Fine-tuning amplifies errors. A dataset with 5% mislabeled examples teaches the model to make similar mistakes. A dataset with biased responses teaches the model to replicate that bias at scale. Auditing before training is cheaper than retraining after discovering issues.

Cost comparison:

  • Audit before training: 2–4 hours, ~$100–$200 labor. Prevents costly retraining.
  • Discover issue after training: 10–20 hours (retraining, analysis, redeployment), ~$1,000–$5,000 compute + labor.

A 2025 survey found that 40% of fine-tuned models in production had detectable quality issues (wrong facts, inconsistent tone, biased responses) that could have been caught by pre-training audits.

Step 1: Manual Spot-Checking

Randomly sample 50–100 examples and review them manually.

import json
import random

def sample_for_review(filepath, sample_size=50, random_seed=42):
"""Sample random examples for manual review."""

random.seed(random_seed)
examples = [json.loads(line) for line in open(filepath)]

sampled = random.sample(examples, min(sample_size, len(examples)))

# Write to human-readable format
with open("review_sample.txt", "w") as f:
for i, ex in enumerate(sampled, 1):
f.write(f"\n{'='*60}\n")
f.write(f"Example {i}\n")
f.write(f"{'='*60}\n")
f.write(f"Instruction: {ex.get('instruction', 'N/A')}\n\n")
f.write(f"Response: {ex.get('response', 'N/A')}\n")
f.write(f"Source: {ex.get('source', 'unknown')}\n")

print(f"Sampled {len(sampled)} examples for review: review_sample.txt")
return sampled

# Create review file
sample_for_review("dataset.jsonl", sample_size=50)

# Then manually open review_sample.txt and check:
# 1. Is the instruction clear and unambiguous?
# 2. Is the response correct and helpful?
# 3. Is the tone consistent with your brand?
# 4. Are there typos, formatting issues, or encoding errors?
# 5. Does the response answer the instruction fully?

Checklist for manual review:

ItemPassFailNotes
Instruction clarityAmbiguous or unclear instructions?
Response accuracyIs the response factually correct?
Tone consistencyDoes tone match brand voice?
CompletenessDoes response fully address the instruction?
Grammar/spellingAny typos or grammatical errors?
PII safetyAre personal details anonymized?
Format validityDoes the example match expected schema?

Count failures. If > 10% of sampled examples fail, the dataset needs cleaning before training.

Step 2: Consistency Validation

Consistency checks ensure related examples agree.

import json
from collections import defaultdict

def check_consistency(filepath):
"""Detect inconsistency: same instruction, different responses."""

examples = [json.loads(line) for line in open(filepath)]

by_instruction = defaultdict(list)
for ex in examples:
instr = ex.get("instruction", "").lower().strip()
by_instruction[instr].append(ex)

inconsistencies = []

for instruction, group in by_instruction.items():
if len(group) > 1:
# Multiple examples with same instruction
responses = set(ex.get("response", "").lower() for ex in group)

if len(responses) > 1:
inconsistencies.append({
"instruction": instruction,
"num_examples": len(group),
"num_distinct_responses": len(responses),
"examples": group
})

if inconsistencies:
print(f"Found {len(inconsistencies)} instructions with conflicting responses:\n")
for item in inconsistencies[:5]: # Show first 5
print(f"Instruction: {item['instruction']}")
print(f" {item['num_examples']} examples, {item['num_distinct_responses']} distinct responses")
for i, ex in enumerate(item['examples'][:2], 1):
print(f" Response {i}: {ex['response'][:100]}...")
print()
else:
print("No consistency issues detected.")

return inconsistencies

inconsistencies = check_consistency("dataset.jsonl")

Action items:

  • If inconsistencies are trivial (different wording, same meaning), keep all.
  • If inconsistencies are significant (conflicting facts), pick the best response and remove others.
  • If you're unsure, mark for manual review.

Step 3: Bias Audit

Audit for gender, cultural, geographic, and demographic biases.

import json
import re

def audit_biases(filepath):
"""Detect potential biases in dataset."""

examples = [json.loads(line) for line in open(filepath)]

# Define sensitive terms by category
biases = {
"gender": {
"male_terms": ["he", "his", "him", "man", "boy", "male"],
"female_terms": ["she", "her", "hers", "woman", "girl", "female"]
},
"race_ethnicity": {
"terms": ["asian", "african", "latino", "white", "black", "indigenous"]
},
"ability": {
"terms": ["disabled", "blind", "deaf", "mental illness", "retarded"]
}
}

findings = {
"gender_imbalance": 0,
"racial_references": [],
"ability_references": [],
"potentially_offensive": []
}

for ex in examples:
text = (ex.get("instruction", "") + " " + ex.get("response", "")).lower()

# Gender imbalance
male_count = sum(1 for term in biases["gender"]["male_terms"] if term in text)
female_count = sum(1 for term in biases["gender"]["female_terms"] if term in text)

if male_count > 0 and female_count == 0:
findings["gender_imbalance"] += 1

# Racial references
for term in biases["race_ethnicity"]["terms"]:
if term in text:
findings["racial_references"].append({
"instruction": ex["instruction"][:100],
"term": term
})

# Ability references
for term in biases["ability"]["terms"]:
if term in text:
findings["ability_references"].append({
"instruction": ex["instruction"][:100],
"term": term
})

# Report
print("Bias Audit Report:")
print(f" Gender imbalance (male-only): {findings['gender_imbalance']}")
print(f" Racial references: {len(findings['racial_references'])}")
print(f" Ability references: {len(findings['ability_references'])}")

if findings["racial_references"]:
print("\n Sample racial references:")
for item in findings["racial_references"][:3]:
print(f" - {item['instruction'][:80]}... (term: {item['term']})")

return findings

findings = audit_biases("dataset.jsonl")

Action items:

  • Gender imbalance: Rewrite examples to use gender-neutral pronouns or mix pronouns.
  • Racial/ethnicity references: Review context. Legitimate references (e.g., cultural history) are OK; stereotypes are not.
  • Ability references: Replace offensive language with respectful terminology.

Step 4: Factuality Check

Spot-check facts in responses, especially for knowledge-intensive domains.

def factuality_check(filepath, domain="general"):
"""Manually verify factual accuracy of sample responses."""

examples = [json.loads(line) for line in open(filepath)]

# Sample examples that contain factual claims
factual_examples = []
for ex in examples:
response = ex.get("response", "")

# Heuristic: responses with dates, numbers, or citations likely have facts
if re.search(r"\d{4}|[0-9]+\%|http", response):
factual_examples.append(ex)

# Sample and review
sample = random.sample(factual_examples, min(10, len(factual_examples)))

print(f"Factuality check (sample of {len(sample)} fact-bearing examples):\n")

for i, ex in enumerate(sample, 1):
print(f"{i}. Instruction: {ex['instruction'][:100]}...")
print(f" Response: {ex['response'][:150]}...")
print(f" Action: VERIFY (manually check facts)\n")

return sample

factuality_check("dataset.jsonl")

# Manual verification process:
# For each example:
# 1. Identify factual claims (dates, statistics, named entities).
# 2. Verify them against reliable sources (Wikipedia, official docs, academic papers).
# 3. Mark examples with factual errors for removal or correction.

Step 5: Pre-Training Test Run

Before full fine-tuning, do a mini test run: fine-tune on 10% of data for 1 epoch.

import json
import random

def create_test_split(filepath, test_ratio=0.1):
"""Create a test dataset for mini fine-tuning."""

examples = [json.loads(line) for line in open(filepath)]
test_examples = random.sample(examples, max(10, int(len(examples) * test_ratio)))

with open("test_finetune.jsonl", "w") as f:
for ex in test_examples:
f.write(json.dumps(ex) + "\n")

print(f"Created test dataset: {len(test_examples)} examples")
return test_examples

# Create test split
test_examples = create_test_split("dataset.jsonl", test_ratio=0.1)

# Fine-tune on this small set (1-2 epochs, rapid)
# Example with OpenAI API:
#
# import openai
# openai.api_key = "..."
#
# response = openai.FineTuningJob.create(
# training_file="test_finetune.jsonl",
# model="gpt-3.5-turbo",
# hyperparameters={"n_epochs": 1} # Just 1 epoch for testing
# )
#
# # Wait for completion, then test:
# model = response['fine_tuned_model']
# test_response = openai.ChatCompletion.create(
# model=model,
# messages=[{"role": "user", "content": "Test prompt"}]
# )

# After test run, evaluate:
# 1. Did training complete without errors?
# 2. Is the response quality acceptable?
# 3. Are outputs in the expected format?
# 4. Did you spot any unexpected behavior?

Test run checklist:

  • Training completes without crashing.
  • Loss decreases over epochs (model is learning).
  • Validation loss is reasonable (not NaN or Inf).
  • Sample inferences produce valid, on-topic responses.
  • Response format matches expected schema.
  • No hallucinations or nonsensical outputs.

If the test run fails, debug the dataset before full training.

Step 6: Data Quality Metrics

Compute quantitative metrics for overall dataset quality.

import json
import numpy as np
from collections import Counter

def compute_quality_metrics(filepath):
"""Compute comprehensive dataset quality metrics."""

examples = [json.loads(line) for line in open(filepath)]

metrics = {
"total_examples": len(examples),
"avg_instruction_length": np.mean([len(ex.get("instruction", "").split()) for ex in examples]),
"avg_response_length": np.mean([len(ex.get("response", "").split()) for ex in examples]),
"min_instruction_length": min([len(ex.get("instruction", "").split()) for ex in examples]),
"max_instruction_length": max([len(ex.get("instruction", "").split()) for ex in examples]),
"examples_with_pii": sum(1 for ex in examples if contains_pii(ex)),
"examples_with_special_chars": sum(1 for ex in examples if re.search(r"[^\w\s\.\,\!\?\-\']", ex.get("response", ""))),
}

# Source distribution
sources = Counter(ex.get("source", "unknown") for ex in examples)
metrics["source_distribution"] = dict(sources)

# Report
print("Dataset Quality Metrics:")
print(f" Total: {metrics['total_examples']}")
print(f" Avg instruction length: {metrics['avg_instruction_length']:.1f} words")
print(f" Avg response length: {metrics['avg_response_length']:.1f} words")
print(f" Examples with PII: {metrics['examples_with_pii']}")
print(f" Examples with special chars: {metrics['examples_with_special_chars']}")
print(f" Sources: {metrics['source_distribution']}")

return metrics

def contains_pii(example):
"""Quick PII detection."""
text = (example.get("instruction", "") + example.get("response", "")).lower()
patterns = [
r"\b\d{3}-\d{2}-\d{4}\b", # SSN
r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" # Email
]
return any(re.search(p, text) for p in patterns)

metrics = compute_quality_metrics("dataset.jsonl")

Quality Standards

MetricAcceptableNeeds Review
Manual review failure rate< 5%> 10%
Inconsistent responses< 2% of examples> 5%
PII detected0> 0
Factual errors (sampled)0–2%> 5%
Gender bias (single-gender only)< 20%> 40%
Test run successYesNo
Avg instruction length10–50 words< 5 or > 100
Avg response length20–200 words< 10 or > 500

If any metric is in the "Needs Review" column, pause and fix before training.

Key Takeaways

  • Manual spot-check 50–100 random examples; if > 10% fail, dataset needs cleaning.
  • Check consistency: same instruction should have same (or very similar) responses.
  • Audit for gender, racial, ability, and geographic biases.
  • Verify sample of factual claims against reliable sources.
  • Do a mini test run on 10% of data before full fine-tuning.
  • Compute quality metrics; flag datasets that exceed thresholds.

Frequently Asked Questions

How much of the dataset should I manually review?

Aim for 5–10% of the dataset or at least 50 examples, whichever is larger. For small datasets (< 500), review 10–20%. For large datasets (> 10,000), 5% is sufficient.

What if I find errors during manual review?

Fix them directly in the dataset (if it's a clear typo or formatting error) or mark them for removal (if they're wrong/incomplete). Re-run consistency checks and quality metrics after fixes.

How do I define "acceptable" bias?

Context matters. Gendered pronouns in examples of diverse characters are OK. A dataset where 95% of technical experts are men is biased. Use demographic representation targets (e.g., "responses should mention women in roles at least 30% of the time") and track them.

Should I audit before or after cleaning?

Audit after cleaning. Cleaning removes duplicates and PII; auditing checks the remaining examples for consistency and quality.

Can I automate the entire QA process?

Partially. Automation catches format errors, PII, and consistency issues. Manual review is still necessary for factuality, tone, and subtle biases. Aim for 80% automated checks, 20% manual.

Further Reading