Dense Vector Search: Embedding-Based Semantic Retrieval
Dense vector search (or dense retrieval) represents documents and queries as continuous high-dimensional embeddings where semantic similarity corresponds to geometric proximity. An embedding model (such as a pre-trained transformer like BERT or a specialized encoder like all-MiniLM-L6-v2) compresses document meaning into a fixed-size vector—typically 384 to 1536 dimensions—where closely related documents cluster together in vector space. At retrieval time, the query is embedded using the same model, and the system returns the k documents whose embeddings are nearest to the query embedding (using cosine similarity or Euclidean distance). Unlike keyword matching, dense retrieval captures semantic variations: "What causes depression?" will retrieve documents about both clinical depression and emotional sadness, addressing the synonym and semantic paraphrasing gaps left by BM25.
How Embeddings Encode Semantic Meaning
A word embedding (and by extension, document embedding) is a learned mapping from text to a vector space where semantic relationships are preserved as geometric relationships. For example, the embedding space for animals might arrange vectors such that:
dogandcatare close together (both are mammals).dogandboneare nearby (semantic association).dogandprogrammingare far apart (no semantic relation).
Modern embeddings are learned using self-supervised transformer models. A model like BERT is pre-trained on massive text corpora to predict masked tokens, learning to encode contextual meaning. When you pass a document through BERT, it outputs a sequence of token embeddings, typically pooled (averaged or [CLS] token extracted) to a single document embedding.
For example, encoding the document "Transformer models revolutionized NLP":
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') # 384-dim embeddings
doc = "Transformer models revolutionized NLP"
embedding = model.encode(doc)
# embedding.shape = (384,)
# embedding = array([-0.021, 0.042, ..., 0.156])
# Similarity to related documents
query = "neural networks for language"
query_embedding = model.encode(query)
cosine_sim = model.similarity(query_embedding, embedding)
# cosine_sim ≈ 0.78 (high similarity)
The embedding is a dense vector where every dimension carries semantic information, unlike sparse BM25 representations. This allows the model to capture meaning even when exact words differ.
Embedding Models: Choices and Trade-offs
Several families of embedding models exist, with different trade-offs:
General-purpose models (BERT, RoBERTa, MPNet) are pre-trained on diverse corpora and work well across domains. All-MiniLM-L6-v2 is a distilled BERT variant popular in RAG: 384 dimensions, <22M parameters, and competitive accuracy.
Domain-specific models (SciBERT for scientific papers, LegalBERT for legal documents) are fine-tuned on domain text, often achieving 5–10% better retrieval accuracy within their domain but worse out-of-domain performance.
Dense retrieval-specific models (DPR—Dense Passage Retrieval, ColBERT) are fine-tuned using contrastive learning on relevance pairs (relevant passage, irrelevant passage), directly optimizing for retrieval performance. DPR achieves state-of-the-art accuracy but requires more computation.
Large embedding models (OpenAI text-embedding-3-large, 3072 dimensions; Cohere embed-english-v3.0) achieve top benchmark accuracy but cost more to compute and store.
A practical choice matrix for RAG:
| Model | Dimensions | Speed (docs/sec) | Accuracy | Cost | Use Case |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 5,000+ | 7.5/10 | Free | Baseline, volume |
| all-mpnet-base-v2 | 768 | 1,500+ | 8.2/10 | Free | General purpose |
| text-embedding-3-small | 512 | 1,000+ | 8.8/10 | $0.02/1M | Production |
| text-embedding-3-large | 3072 | 300+ | 9.2/10 | $0.13/1M | High accuracy |
For most RAG applications, start with all-MiniLM-L6-v2 (free, fast) or OpenAI's text-embedding-3-small (small cost, better accuracy). Upgrade to larger models only if retrieval accuracy is a bottleneck.
Vector Search Indexing: FAISS and Vector Databases
Once documents are embedded, they are stored in a vector index for fast similarity search. For small corpora (<100k documents), exact nearest-neighbor search (computing distance to every document) is feasible. For larger corpora, approximate nearest neighbor (ANN) indexing is necessary.
FAISS (Facebook AI Similarity Search) is a popular open-source library for large-scale similarity search:
import faiss
import numpy as np
# Create embeddings for 1M documents (768-dim each)
embeddings = np.random.random((1000000, 768)).astype('float32')
# Build FAISS index (Inverted File + Product Quantization)
index = faiss.IndexIVFPQ(
faiss.IndexFlatL2(768), # Base: L2 distance
768, # Dimension
100, # Number of cells (clusters)
8 # Bytes per vector after quantization
)
# Train on a sample (required for IVF)
sample = embeddings[:10000]
index.train(sample)
index.add(embeddings)
# Query
query_embedding = np.random.random((1, 768)).astype('float32')
distances, indices = index.search(query_embedding, k=10)
print(f"Top 10 nearest document indices: {indices[0]}")
Vector databases (Pinecone, Weaviate, Milvus, Qdrant) abstract away index management and provide REST APIs, cloud hosting, and metadata filtering. For production RAG systems, vector databases are preferable to raw FAISS because they handle scale, replication, and updates transparently.
Semantic Similarity Metrics: Cosine vs. Euclidean
Two primary distance metrics are used for embedded vectors:
Cosine similarity: Measures the angle between vectors. Two documents in the same direction (semantically aligned) have high cosine similarity, even if magnitudes differ. This is the standard for text embeddings because magnitude (how "strong" the semantic signal is) is less important than direction.
cosine_similarity = (A · B) / (||A|| * ||B||) ranges from -1 (opposite) to 1 (identical).
Euclidean distance: Measures straight-line distance in vector space. It is sensitive to magnitude, so documents with similar meaning but different intensities score lower. Euclidean is more appropriate for bounded embeddings or when magnitude carries semantic meaning (e.g., confidence scores).
For text embeddings, cosine similarity is recommended. Most vector databases default to cosine similarity for normalized embeddings (unit norm).
Example: Building a Dense Retrieval System
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample document corpus
documents = [
"Transformer attention mechanism is a core component of modern NLP.",
"The attention mechanism in transformer neural networks scales quadratically.",
"BERT uses transformer architecture for natural language processing.",
"Recurrent neural networks process sequences one token at a time.",
]
# Embed all documents (in production, use a vector DB)
doc_embeddings = model.encode(documents) # Shape: (4, 384)
# User query
query = "How do transformers use attention?"
query_embedding = model.encode(query) # Shape: (384,)
# Compute similarities
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
# similarities = [0.82, 0.71, 0.79, 0.31]
# Rank and retrieve top-k
top_k = 3
top_indices = np.argsort(similarities)[::-1][:top_k]
results = [(documents[i], similarities[i]) for i in top_indices]
for doc, score in results:
print(f"Score: {score:.2f}, Doc: {doc}")
# Output:
# Score: 0.82, Doc: Transformer attention mechanism is...
# Score: 0.79, Doc: BERT uses transformer architecture...
# Score: 0.71, Doc: The attention mechanism in transformer...
Strengths and Limitations of Dense Retrieval
Strengths:
- Semantic Understanding: Captures synonyms, paraphrasing, and implicit semantic relations that BM25 misses.
- Short-Query Robustness: Works well even on short, ambiguous queries because embeddings encode contextual meaning.
- Cross-Lingual Retrieval: Multilingual embedding models encode semantically related documents from different languages nearby in vector space, enabling cross-lingual search.
Limitations:
- Hallucination Risk: A semantically similar document may not contain the exact facts needed. For example, "How many parameters does GPT-3 have?" may retrieve semantically related documents about large language models without stating the actual 175B parameter count.
- Latency and Cost: Embedding every query and computing k-NN search is slower than BM25 and more expensive (especially with large models). Vector database operations are typically 10–100x slower than keyword indices.
- Sensitivity to Embedding Quality: If the embedding model is weak or misaligned with your domain, retrieval quality degrades. Fine-tuning a domain-specific embedding model requires labeled relevance data.
Key Takeaways
- Dense embeddings represent document semantics as high-dimensional vectors, enabling semantic similarity matching that transcends exact keywords.
- Embedding models (BERT, all-MiniLM, etc.) are pre-trained on massive corpora to encode contextual meaning; document embeddings are typically pooled from token embeddings.
- For most RAG systems, all-MiniLM-L6-v2 (free, 384-dim) or OpenAI text-embedding-3-small (affordable, 512-dim) provide a good accuracy-cost balance.
- Vector indexing (FAISS, vector databases) enables fast approximate k-NN search on millions of documents; cosine similarity is the standard metric for text embeddings.
- Dense retrieval excels at semantic understanding and paraphrasing but risks hallucination on factual queries requiring exact keyword matches.
Frequently Asked Questions
How many dimensions should my embeddings have?
Start with 384 (all-MiniLM-L6-v2) for prototyping and most production systems. Larger dimensions (768, 1024, 3072) capture more semantic nuance but increase storage and compute cost by 2–8x with marginal accuracy gains. For dense RAG retrieval, 384–768 dimensions are near-optimal; benefit plateaus beyond 1024.
Should I fine-tune an embedding model on my domain data?
Only if you have 1,000+ labeled relevance pairs (query, relevant document, irrelevant document). Fine-tuning with contrastive learning (e.g., using Sentence Transformers' training framework) can improve domain-specific accuracy by 5–15%. Without labels, fine-tuning risks overfitting and may hurt general retrieval.
How do I choose between cosine similarity and Euclidean distance?
For text embeddings normalized to unit length (which most embedding models do), cosine similarity and Euclidean distance are mathematically equivalent after scaling. Use cosine similarity by default; switch to Euclidean only if your embeddings are not normalized or magnitude is semantically meaningful.
Can I combine dense retrieval with keyword expansion (e.g., synonyms)?
Yes. Query expansion techniques (described in Article 8) generate alternative phrasings of the query, embed each variant, and retrieve from multiple embeddings. This is orthogonal to dense retrieval and often improves recall by 5–10%.
What is the difference between all-MiniLM-L6-v2 and text-embedding-3-small?
all-MiniLM is a free, distilled BERT trained on general text. text-embedding-3-small is OpenAI's embedding model, fine-tuned for retrieval quality and covering more languages. text-embedding-3-small achieves 2–5% better retrieval accuracy on benchmarks but costs $0.02 per 1M tokens. For production systems prioritizing accuracy, text-embedding-3-small; for cost-sensitive, all-MiniLM.
Further Reading
- Sentence Transformers Documentation — Popular Python library for sentence/document embeddings
- FAISS: A library for efficient similarity search — Industrial-scale vector indexing
- Dense Passage Retrieval (DPR) Paper — Pioneering work on dense retrieval for open-domain QA
- Cohere Embeddings API Documentation — Production embedding service with ranking-aware models