Vector Embeddings and Semantic Search in RAG
Vector embeddings are the mathematical heart of modern RAG systems. An embedding is a dense vector (e.g., 384 or 1536 dimensions) that encodes the semantic meaning of text, allowing a machine to compute the similarity between two texts as the distance between their vectors. Unlike keyword matching, which fails on synonyms and domain jargon, semantic search via embeddings captures the meaning behind words, enabling retrieval that understands intent. This article covers how embeddings work, how to choose an embedding model, and how to build a semantic search pipeline that scales.
What Are Vector Embeddings?
An embedding is a list of numbers that represents the semantic meaning of a text. For example, the word "king" might be represented as [0.2, -0.5, 0.8, ...] in 384 dimensions. The key property is that similar texts produce vectors that are close together in this high-dimensional space. The similarity is typically measured using cosine similarity, which computes the angle between two vectors: an angle of 0° (cosine = 1) means identical meaning; 90° (cosine = 0) means unrelated; 180° (cosine = -1) means opposite meaning.
Modern embeddings are produced by neural networks trained on billions of text pairs to predict whether two sentences are similar or different. This training process implicitly learns to place semantically similar texts near each other in the embedding space. For RAG, you use a pre-trained embedding model (you don't train your own; the overhead is prohibitive for most teams). The model encodes your chunks and queries into vectors, then retrieves chunks whose vectors are closest to the query vector.
Choosing an Embedding Model
The embedding model you select dramatically impacts RAG quality and cost. Here are the leading options in 2026:
| Model | Dimensions | Latency | Quality | Cost | Best For |
|---|---|---|---|---|---|
OpenAI text-embedding-3-large | 1536 | 50–100ms | Excellent | $0.13/M tokens | Production, general domains |
OpenAI text-embedding-3-small | 512 | 20–40ms | Good | $0.02/M tokens | Cost-sensitive, faster latency |
Cohere embed-english-v3.0 | 1024 | 30–70ms | Excellent | $0.10/M tokens | Domain-specific, multilingual |
Nomic nomic-embed-text-v1.5 | 768 | 10–30ms | Good | Free (local) | Private data, offline retrieval |
| MTEB Top Open Source | 384–768 | 5–20ms | Very Good | Free (local) | Academic, open-source stack |
For 2026, OpenAI's text-embedding-3-large dominates production deployments due to its superior quality (few false positives) and support for dimension reduction, allowing you to trade quality for speed. If cost is a constraint or data privacy is critical, Nomic's open-source model runs locally and performs well for most domains.
Building a Semantic Search Pipeline
Here is a complete end-to-end pipeline using OpenAI embeddings:
import os
from openai import OpenAI
import numpy as np
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def embed_text(text: str, model: str = "text-embedding-3-small") -> list[float]:
"""Embed a single text string into a dense vector."""
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
"""Embed multiple texts efficiently in a single API call."""
response = client.embeddings.create(
input=texts,
model=model
)
# Sort by index to ensure order matches input
embeddings_dict = {item.index: item.embedding for item in response.data}
return [embeddings_dict[i] for i in range(len(texts))]
def cosine_similarity(vec1: list[float], vec2: list[float]) -> float:
"""Compute cosine similarity between two vectors."""
arr1, arr2 = np.array(vec1), np.array(vec2)
return float(np.dot(arr1, arr2) / (np.linalg.norm(arr1) * np.linalg.norm(arr2) + 1e-8))
# Example: embed chunks and retrieve top-K
chunks = [
"Machine learning is a subset of artificial intelligence.",
"The transformer architecture was introduced in 2017.",
"Python is the most popular language for data science.",
]
# Index: embed all chunks once
chunk_embeddings = embed_batch(chunks)
print(f"Indexed {len(chunks)} chunks")
# Query: embed the query and find top-K similar chunks
query = "What is deep learning and how does it relate to AI?"
query_embedding = embed_text(query)
# Compute similarity scores
similarities = [
(i, cosine_similarity(query_embedding, chunk_emb))
for i, chunk_emb in enumerate(chunk_embeddings)
]
# Sort by similarity (descending)
similarities.sort(key=lambda x: x[1], reverse=True)
top_k = 2
print(f"\nTop {top_k} results for query: '{query}'")
for rank, (idx, score) in enumerate(similarities[:top_k], 1):
print(f"{rank}. (score={score:.3f}) {chunks[idx]}")
Output:
Indexed 3 chunks
Top 2 results for query: 'What is deep learning and how does it relate to AI?'
1. (score=0.842) Machine learning is a subset of artificial intelligence.
2. (score=0.768) The transformer architecture was introduced in 2017.
The pipeline is simple: embed chunks at indexing time (once), embed the query at retrieval time (per query), compute similarities, and return the top-K.
Optimizing for Scale: Approximate Nearest Neighbor Search
Computing exact cosine similarity with millions of vectors is slow. For a 1M-chunk index, a naive search would require 1M distance computations (~100ms). For real-time systems, this is too slow. Instead, use Approximate Nearest Neighbor (ANN) algorithms, which trade a tiny accuracy loss for huge speed gains. The leading libraries in 2026 are:
- FAISS (Meta): In-memory ANN, best for dense clusters, ~1–5ms for 1M vectors.
- Pinecone: Managed vector DB, handles index replication and filtering automatically.
- Weaviate: Open-source vector DB, supports hybrid (keyword + vector) search.
- PostgreSQL pgvector: If your data is already in Postgres, pgvector adds vector indexing to the same database.
Here is a toy example using FAISS:
import faiss
import numpy as np
from openai import OpenAI
client = OpenAI()
# Suppose you have 1000 chunks and their embeddings (shape: 1000 x 1536)
num_chunks = 1000
embedding_dim = 1536
chunk_embeddings_array = np.random.randn(num_chunks, embedding_dim).astype('float32')
# Create a FAISS index (IndexFlatL2 = exact, for demo; use IndexIVFFlat or HNSW for scale)
index = faiss.IndexFlatL2(embedding_dim)
index.add(chunk_embeddings_array)
print(f"FAISS index now contains {index.ntotal} vectors")
# Query: find top 5 nearest neighbors
query_text = "How does RAG improve LLM accuracy?"
query_embedding = np.array(
client.embeddings.create(
input=query_text,
model="text-embedding-3-small"
).data[0].embedding,
dtype='float32'
).reshape(1, -1)
# Search: return distances and indices
distances, indices = index.search(query_embedding, k=5)
print(f"\nTop 5 nearest chunks for '{query_text}':")
for rank, (idx, dist) in enumerate(zip(indices[0], distances[0]), 1):
# L2 distance: smaller = more similar. Convert to similarity score (0–1).
similarity = 1 / (1 + dist) # Rough conversion; cosine is more precise
print(f"{rank}. Chunk {idx} (similarity ~{similarity:.3f})")
FAISS is ideal if your index fits in memory (~1000 chunks per GB); for larger indices, use a managed service like Pinecone or Weaviate, which handle sharding and replication transparently.
Common Pitfalls in Semantic Search
Embedding Drift: Your embedding model generates consistent vectors only within its own space. If you switch embedding models mid-index, old and new embeddings are incomparable, and retrieval breaks. Always version your index and migration carefully.
Curse of Dimensionality: In very high dimensions (1536+), all vectors become equidistant. This is why larger embedding dimensions don't always help; they also increase latency and memory. Experiment with dimension reduction (OpenAI text-embedding-3-large supports this via API parameters).
Forgetting Metadata: Storing only the embedding vector in your index loses crucial information. Always co-store metadata: chunk source, timestamp, access tags, and the original text (for display). This is essential for citations and security (covered in articles 6 and 7).
Key Takeaways
- Vector embeddings encode semantic meaning as dense vectors; similar texts produce nearby vectors.
- Cosine similarity measures the angle between vectors: angle of 0° indicates identical meaning.
- Choose embedding models based on quality/cost trade-off: OpenAI
text-embedding-3-largeor Cohere for production; Nomic for privacy. - Use Approximate Nearest Neighbor (ANN) search (FAISS, Pinecone, Weaviate) for sub-100ms retrieval at scale.
- Always store metadata alongside embeddings for debugging, citations, and access control.
Frequently Asked Questions
Why not just use keyword search (like Elasticsearch) instead of embeddings?
Keyword search is fast and requires no model, but it fails on synonyms and domain jargon. "LLM" and "large language model" are synonymous, but keyword search treats them as different. Embeddings understand intent and handle paraphrases; they outperform keyword-only search by 30–50% on real-world QA tasks. Production systems usually blend both (hybrid retrieval; see article 4).
What is the difference between word embeddings (Word2Vec) and sentence embeddings?
Word embeddings (Word2Vec, GloVe) encode individual words, useful for downstream models but not directly comparable across documents. Sentence embeddings (trained via contrastive learning) encode entire chunks, directly comparable for retrieval. Always use sentence/passage embeddings for RAG, not word embeddings.
How often should I re-embed my chunks if my knowledge base changes?
Only re-embed chunks that changed. If a document is updated, re-embed only the affected chunks and update their vectors in the index. Bulk re-embedding every chunk is expensive and unnecessary; embeddings are deterministic and stable.
Can I use the same embedding model as my LLM?
Not recommended. LLM embeddings (from the LLM's final hidden layer) are optimized for language generation, not retrieval. Use a dedicated embedding model (OpenAI's embedding API, Cohere, Nomic) trained explicitly for semantic search. They outperform LLM embeddings by 20–40%.
How do I handle multi-lingual documents?
Cohere embed-multilingual-v3.0 and Nomic's multilingual model handle 100+ languages in a single embedding space. OpenAI's embedding models are English-first; use their multilingual variants if you need non-English. Evaluate on your specific language pairs.
Further Reading
- What Are Embeddings? — OpenAI's official embedding documentation.
- Embeddings: From Word2Vec to Transformers — foundational paper on embedding theory.
- MTEB Leaderboard: Massive Text Embedding Benchmark — benchmark comparing 300+ embedding models.
- Approximate Nearest Neighbor Search in High Dimensions — theory and practice of ANN algorithms.