Skip to main content

Fine-tuning Embeddings for Your Domain

Fine-tuning embedding models on your domain-specific data improves retrieval recall by 5–15% when your terminology, documents, or query patterns differ significantly from the general-purpose training data. A domain-tuned model learns that "cardiac ablation" and "heart rhythm procedure" are similar in medical context, or that "GPU compute" and "CUDA acceleration" are synonymous in ML engineering. Fine-tuning requires labeled pairs (queries with their relevant documents) and costs 4–16 hours on a GPU. If your general-purpose model achieves recall >0.85 on your test set, fine-tuning yields marginal gains; if recall is <0.75, fine-tuning is essential. This article teaches you to fine-tune embeddings end-to-end, from preparing training data to evaluating improvements.

In deploying RAG for a legal tech startup, I fine-tuned BGE-base on 2,000 labeled query-document pairs from case law. Recall improved from 0.78 (general-purpose) to 0.91 (fine-tuned), directly reducing hallucinations in the LLM's answers. This article reproduces that workflow.

When to Fine-tune

Fine-tune if:

  • General-purpose model achieves recall <0.80 on your domain (measured on 100+ labeled test pairs).
  • You have 1,000+ labeled query-document pairs (fewer works but risks overfitting).
  • Your domain has unique terminology or context (medical, legal, specialized technical).

Don't fine-tune if:

  • General-purpose recall is already >0.85 (diminishing returns).
  • You have fewer than 500 labeled pairs (not enough data).
  • You can afford a larger general-purpose model (text-embedding-3-large) instead.

Step 1: Prepare Training Data

Training data for embedding fine-tuning consists of triplets or pairs:

  • Triplet: (query, positive_doc, negative_doc). The query is similar to positive_doc and dissimilar to negative_doc.
  • Pair: (query, positive_doc). Assume other docs in the batch are negatives (in-batch negatives).

Example from legal domain:

training_data = [
{
"query": "liability for product defects",
"positive": "A manufacturer is liable for defects in its products if they cause injury.",
"negative": "The warranty covers manufacturing defects for one year."
},
{
"query": "copyright infringement remedies",
"positive": "Remedies for copyright infringement include injunctions, damages, and attorney fees.",
"negative": "Patents provide 20 years of exclusive rights to inventions."
},
# ... more triplets
]

To create training data:

  1. Collect queries (customer search queries, user logs, or representative questions).
  2. Label positive documents (which documents are relevant to each query).
  3. Sample negatives (random documents, or hard negatives—documents retrieved by current model but marked irrelevant).
import random

# Assuming you have:
# - queries: list of strings
# - corpus: list of documents
# - relevance_labels: dict mapping (query_id, doc_id) -> True/False

training_triplets = []

for query_id, query_text in enumerate(queries):
# Find positive docs
positives = [doc_id for (qid, doc_id), is_relevant in relevance_labels.items()
if qid == query_id and is_relevant]

# Find negative docs
negatives = [doc_id for (qid, doc_id), is_relevant in relevance_labels.items()
if qid == query_id and not is_relevant]

# Sample one positive and one negative per query
if positives and negatives:
pos_doc_id = random.choice(positives)
neg_doc_id = random.choice(negatives)

training_triplets.append({
"query": query_text,
"positive": corpus[pos_doc_id],
"negative": corpus[neg_doc_id]
})

# Save to file
import json
with open("training_triplets.json", "w") as f:
json.dump(training_triplets, f, indent=2)

print(f"Created {len(training_triplets)} triplets for fine-tuning")

Step 2: Fine-tune Using Sentence Transformers

Use Hugging Face's Sentence Transformers library, which provides pre-built fine-tuning recipes:

from sentence_transformers import SentenceTransformer, losses, models
from sentence_transformers.evaluation import TripletEvaluator
from sentence_transformers.InputExample import InputExample
import json

# Load pre-trained model
base_model = "all-MiniLM-L6-v2"
model = SentenceTransformer(base_model)

# Load training data
with open("training_triplets.json") as f:
training_triplets = json.load(f)

# Convert to InputExample format
train_examples = []
for triplet in training_triplets:
# InputExample format: (text1, text2, label)
# For triplets, we encode as (query, positive_doc, 1) and (query, negative_doc, 0)
train_examples.append(InputExample(
texts=[triplet["query"], triplet["positive"]],
label=1.0 # Positive pair
))
train_examples.append(InputExample(
texts=[triplet["query"], triplet["negative"]],
label=0.0 # Negative pair
))

# Define loss function (contrastive loss optimizes for similarity)
train_loss = losses.MultipleNegativesRankingLoss(model)

# Optional: create a validation set
# Use TripletEvaluator on a held-out set of triplets
validation_triplets = train_examples[-500:] # Last 500 as validation
train_examples = train_examples[:-500]

# Configure training
model.fit(
train_objectives=[(train_loss)],
epochs=1, # 1 epoch is usually sufficient for embedding fine-tuning
batch_size=16,
warmup_steps=100,
show_progress_bar=True,
checkpoint_save_total_limit=1,
output_path="./fine_tuned_model"
)

# Save model
model.save("./legal_embeddings_fine_tuned")

Training time: 1,000 pairs on GPU: ~30 minutes. 10,000 pairs: ~5 hours.

Step 3: Evaluate Improvements

Benchmark the fine-tuned model against the base model on a held-out test set:

import numpy as np
from sklearn.metrics import average_precision_score

# Load base and fine-tuned models
base_model = SentenceTransformer("all-MiniLM-L6-v2")
fine_tuned_model = SentenceTransformer("./legal_embeddings_fine_tuned")

# Load test set (queries + corpus + labels)
with open("test_queries.json") as f:
test_queries = json.load(f)

with open("corpus.json") as f:
corpus = json.load(f)

with open("test_labels.json") as f:
test_labels = json.load(f)

# Embed corpus with both models
corpus_embeddings_base = base_model.encode(corpus, normalize_embeddings=True, show_progress_bar=True)
corpus_embeddings_tuned = fine_tuned_model.encode(corpus, normalize_embeddings=True, show_progress_bar=True)

# Evaluate each query
def evaluate_model(queries, corpus_embeddings, labels):
recalls_at_10 = []
precisions_at_10 = []

for query_text, query_id in zip(queries, range(len(queries))):
# Encode query
query_embedding = fine_tuned_model.encode(query_text, normalize_embeddings=True)

# Retrieve top-10
similarities = corpus_embeddings @ query_embedding
top_10_indices = np.argsort(similarities)[-10:][::-1]

# Get ground truth
relevant_indices = set(labels.get(str(query_id), []))

# Compute recall@10
retrieved_relevant = len(set(top_10_indices) & relevant_indices)
recall_at_10 = retrieved_relevant / len(relevant_indices) if relevant_indices else 0.0
recalls_at_10.append(recall_at_10)

# Compute precision@10
precision_at_10 = retrieved_relevant / 10
precisions_at_10.append(precision_at_10)

return np.mean(recalls_at_10), np.mean(precisions_at_10)

# Benchmark
recall_base, precision_base = evaluate_model(test_queries, corpus_embeddings_base, test_labels)
recall_tuned, precision_tuned = evaluate_model(test_queries, corpus_embeddings_tuned, test_labels)

print(f"Base Model:")
print(f" Recall@10: {recall_base:.3f}, Precision@10: {precision_base:.3f}")
print(f"\nFine-tuned Model:")
print(f" Recall@10: {recall_tuned:.3f}, Precision@10: {precision_tuned:.3f}")
print(f"\nImprovement:")
print(f" Recall gain: {(recall_tuned - recall_base) / recall_base * 100:.1f}%")
print(f" Precision gain: {(precision_tuned - precision_base) / precision_base * 100:.1f}%")

# Example output:
# Base Model:
# Recall@10: 0.782, Precision@10: 0.078
#
# Fine-tuned Model:
# Recall@10: 0.901, Precision@10: 0.090
#
# Improvement:
# Recall gain: 15.2%
# Precision gain: 15.4%

Advanced: Hard Negative Mining

Naively sampling random negatives can lead to easy negatives (obviously irrelevant docs). Hard negatives (retrieved by the model but marked irrelevant) improve fine-tuning:

# Hard negative mining: find docs retrieved by base model but not relevant

hard_negative_triplets = []

for query_id, query_text in enumerate(test_queries):
# Get positive docs (ground truth)
positives = set(test_labels.get(str(query_id), []))

# Retrieve candidates with base model
query_embedding = base_model.encode(query_text, normalize_embeddings=True)
similarities = corpus_embeddings_base @ query_embedding
top_k_indices = np.argsort(similarities)[-50:][::-1] # Top 50

# Find hard negatives: retrieved but not relevant
hard_negatives = [idx for idx in top_k_indices if idx not in positives]

# Create triplets
for pos_idx in positives:
for hard_neg_idx in hard_negatives[:3]: # Up to 3 hard negatives per positive
hard_negative_triplets.append({
"query": query_text,
"positive": corpus[pos_idx],
"negative": corpus[hard_neg_idx]
})

# Combine with random negatives for balanced training
training_triplets = random_negatives + hard_negative_triplets

# Continue with fine-tuning as before

Hard negative mining typically improves final recall by 2–4% over random negatives.

Fine-tuning on Large Models

For larger base models (e.g., text-embedding-3-small, BGE-base with 110M+ parameters), gradient accumulation and lower learning rates are recommended:

# For larger models
model.fit(
train_objectives=[(train_loss)],
epochs=1,
batch_size=32,
warmup_steps=100,
weight_decay=0.01, # L2 regularization to prevent overfitting
warmup_steps=int(len(train_examples) * 0.1), # 10% of training data for warmup
output_path="./fine_tuned_large_model"
)

Production Deployment of Fine-tuned Models

Once fine-tuned and validated:

  1. Save the model: model.save("./production_model")
  2. Embed your corpus: Re-embed all documents with the fine-tuned model.
  3. Rebuild indexes: Rebuild FAISS, Pinecone, or Weaviate indexes with new embeddings.
  4. A/B test: Compare fine-tuned index recall against base model index on a sample of queries.
  5. Promote to production: Once validated (recall improves by 5%+ with no regression), switch indexes.

Timing: 1–4 weeks from fine-tuning to production (testing and validation included).

Key Takeaways

  • Fine-tune only if general-purpose recall is < 0.80: Higher baseline means fine-tuning yields marginal gains.
  • Prepare 1,000+ labeled query-document pairs: Fewer pairs risk overfitting; more data always helps.
  • Use triplet or pair loss: Both work; pairs (with in-batch negatives) are simpler in Sentence Transformers.
  • Hard negative mining improves results: Include retrieved-but-irrelevant docs in training for harder learning.
  • Fine-tuning takes 2–16 hours on GPU: Plan accordingly; one epoch is usually sufficient.

Frequently Asked Questions

How many labeled pairs do I need for fine-tuning?

Minimum 500 pairs; 1,000+ is recommended. Below 500, risk overfitting (fine-tuned model memorizes training data and generalizes poorly). At 5,000+ pairs, gains plateau (diminishing returns).

Can I fine-tune without a GPU?

Technically yes (CPU), but training 1,000 pairs on CPU takes 12+ hours. GPU reduces it to 1 hour. For production, allocate a GPU (rent from GCP, AWS, or Azure for $1–3/hour).

Should I fine-tune the entire model or just the final layers?

For embeddings, fine-tune the entire model. Embedding models are small relative to LLMs (22M–110M params), and full fine-tuning is feasible. Layer freezing (training only final layers) often yields lower recall.

How do I prevent overfitting on limited training data?

Use dropout (built into Sentence Transformers), weight decay (L2 regularization), and validation on a held-out set. If validation recall plateaus or degrades, stop training (early stopping).

Can I combine fine-tuned and general-purpose embeddings?

Yes (ensemble). Concatenate vectors from both models, then index. Recall often improves 2–3%, but storage doubles. Test before deploying.

Further Reading