Skip to main content

Building Vector Search Index: Step-by-Step

Building a vector search index consists of four steps: embed all documents into vectors, choose a vector database (cloud API, self-hosted, or in-memory), load vectors with metadata, and query. A minimal index takes 30 minutes to build and deploy; a production index (1 billion vectors, replicated, monitored) takes weeks. This article walks you through a complete working example using Python, so you can retrieve semantically relevant documents in milliseconds. You will learn to balance index size, query latency, memory constraints, and cost—the real trade-offs of production systems.

In my experience, 60% of vector search projects fail in the early index-building phase because teams choose the wrong database or embedding pipeline. This article short-cuts those mistakes.

The Vector Search Index Architecture

A vector search index consists of:

  1. Document corpus (CSV, JSON, or database): Your data (e.g., 100,000 product descriptions).
  2. Embedding model: Neural network that converts documents to vectors (e.g., text-embedding-3-small).
  3. Vector database (Pinecone, Weaviate, Milvus, or FAISS): Stores vectors and metadata, enables fast approximate nearest-neighbor (ANN) search.
  4. Indexing algorithm (HNSW, IVF, or tree-based): Makes ANN search fast by organizing vectors into a search structure.
  5. Query pipeline: Encode user query, search index, rank and filter results, return with metadata.

Step 1: Prepare Documents and Metadata

Start with your document corpus. Here is a minimal example:

documents = [
{
"id": "doc_001",
"title": "Best Dogs for Apartment Living",
"content": "Small breeds like French Bulldogs and Pugs thrive in apartments...",
"category": "pets"
},
{
"id": "doc_002",
"title": "Training Your Golden Retriever",
"content": "Golden Retrievers are intelligent and eager to please...",
"category": "training"
},
{
"id": "doc_003",
"title": "Apartment Living Guide",
"content": "Living in a small space requires smart organization...",
"category": "lifestyle"
}
]

Metadata (id, title, category) will be stored alongside vectors for retrieval context.

Step 2: Embed All Documents

Use an embedding model to convert each document to a vector. For larger corpora, batch the encoding to leverage GPU:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Combine title and content for richer embedding
texts_to_embed = [
f"{doc['title']}. {doc['content']}"
for doc in documents
]

# Batch encoding (more efficient than one-by-one)
embeddings = model.encode(
texts_to_embed,
batch_size=32, # Adjust based on GPU memory
normalize_embeddings=True, # Important: normalize for cosine similarity
show_progress_bar=True
)

# embeddings shape: (num_documents, embedding_dim)
# e.g., (3, 384) for all-MiniLM-L6-v2

print(f"Embedded {len(embeddings)} documents to {embeddings.shape[1]} dimensions")

For 100,000 documents with batch_size=32 on a modest GPU: ~5–10 minutes. For 1 million documents: ~1–2 hours.

Step 3: Choose a Vector Database

Three options:

Option A: Pinecone (Cloud, Easiest)

Pinecone is a managed vector database. You create an index, push vectors via API, and search. No infrastructure management.

import pinecone
from pinecone import Pinecone

# Initialize Pinecone (requires API key from pinecone.io)
pc = Pinecone(api_key="your-api-key")

# Create index (one-time setup)
index_name = "apartment-dogs"
pc.create_index(
name=index_name,
dimension=384,
metric="cosine", # Use cosine similarity
spec={"serverless": {"cloud": "aws", "region": "us-east-1"}}
)

# Get index reference
index = pc.Index(index_name)

# Upsert vectors with metadata
vectors_to_upsert = []
for doc, embedding in zip(documents, embeddings):
vector_id = doc["id"]
values = embedding.tolist() # Convert numpy to list
metadata = {
"title": doc["title"],
"category": doc["category"],
"text_preview": doc["content"][:200] # Store preview for debugging
}
vectors_to_upsert.append((vector_id, values, metadata))

# Upsert in batches (Pinecone has limits: ~100 vectors per API call)
batch_size = 100
for i in range(0, len(vectors_to_upsert), batch_size):
batch = vectors_to_upsert[i:i+batch_size]
index.upsert(vectors=batch)
print(f"Upserted {min(i+batch_size, len(vectors_to_upsert))}/{len(vectors_to_upsert)}")

# Query
query_text = "best small dogs for apartments"
query_embedding = model.encode(query_text, normalize_embeddings=True).tolist()

results = index.query(
vector=query_embedding,
top_k=3,
include_metadata=True
)

for match in results["matches"]:
print(f"ID: {match['id']}, Score: {match['score']:.3f}")
print(f" Title: {match['metadata']['title']}")

Cost: ~$0.50/month for < 100K vectors (free tier available, 2026 pricing).

Option B: FAISS (Local, Free, Fast)

FAISS (Facebook AI Similarity Search) is an in-memory index library. Ideal for < 10 million vectors on a single machine.

import faiss
import numpy as np

# Create a simple flat (brute-force) index
dimension = 384
index = faiss.IndexFlatL2(dimension) # L2 is Euclidean; for cosine, normalize vectors

# Convert embeddings to float32 (required by FAISS)
embeddings_float32 = embeddings.astype('float32')

# Add vectors to index
index.add(embeddings_float32)

# Create a mapping from FAISS index position to document ID and metadata
doc_metadata = {
i: {
"id": doc["id"],
"title": doc["title"],
"category": doc["category"]
}
for i, doc in enumerate(documents)
}

# Query
query_text = "best small dogs for apartments"
query_embedding = model.encode(query_text, normalize_embeddings=True).astype('float32').reshape(1, -1)

distances, indices = index.search(query_embedding, k=3)

for idx, distance in zip(indices[0], distances[0]):
metadata = doc_metadata[idx]
# For L2 distance on normalized vectors, distance ~= sqrt(2(1 - cosine_sim))
# Convert back to cosine-like score
cosine_sim = 1 - (distance**2) / 2
print(f"ID: {metadata['id']}, Cosine Similarity: {cosine_sim:.3f}")
print(f" Title: {metadata['title']}")

# Save index to disk for later loading
faiss.write_index(index, "apartment_dogs_index.faiss")

Cost: Free. Single-machine constraint.

Option C: Weaviate (Self-Hosted, Balanced)

Weaviate is an open-source vector database. Deploy it locally with Docker or in Kubernetes.

import weaviate
from weaviate.classes.config import Configure, Property, DataType

# Connect to Weaviate (assumes Docker running: docker run -p 8080:8080 semitechnologies/weaviate)
client = weaviate.connect_to_local()

# Define schema
class_definition = {
"class": "Document",
"description": "A document with embedding",
"properties": [
Property(name="title", data_type=DataType.TEXT),
Property(name="content", data_type=DataType.TEXT),
Property(name="category", data_type=DataType.TEXT),
],
"vectorizer": "none" # We provide vectors manually
}

# Create class
client.schema.create_class(class_definition)

# Add documents with vectors
for doc, embedding in zip(documents, embeddings):
client.data.create(
class_name="Document",
uuid=doc["id"],
properties={
"title": doc["title"],
"content": doc["content"],
"category": doc["category"]
},
vector=embedding.tolist()
)

# Query
query_text = "best small dogs for apartments"
query_embedding = model.encode(query_text, normalize_embeddings=True).tolist()

response = client.graphql.raw(f"""{{
Get{{
Document(
nearVector: {{ vector: {query_embedding} }}
limit: 3
) {{
title
category
_additional {{ distance }}
}}
}}
}}""")

for doc in response["data"]["Get"]["Document"]:
print(f"Title: {doc['title']}, Distance: {doc['_additional']['distance']:.3f}")

Cost: Free software; hosting cost depends on your infrastructure (~$200–1,000/month for modest load).

Step 4: Query the Index

Once indexed, query is simple:

  1. Encode the query text.
  2. Search the index for nearest neighbors.
  3. Retrieve and rank results.
  4. Return with metadata.

A complete query loop (including ranking and filtering) in Pinecone:

def semantic_search(query_text, top_k=5, filter_category=None):
# Encode query
query_embedding = model.encode(query_text, normalize_embeddings=True).tolist()

# Search index
search_filter = None
if filter_category:
search_filter = {"category": {"$eq": filter_category}}

results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True,
filter=search_filter
)

# Format results
ranked_results = []
for match in results["matches"]:
ranked_results.append({
"id": match["id"],
"score": match["score"],
"title": match["metadata"]["title"],
"category": match["metadata"]["category"],
"preview": match["metadata"]["text_preview"]
})

return ranked_results

# Usage
results = semantic_search("small dogs for city living", top_k=3, filter_category="pets")
for result in results:
print(f"{result['title']} (score: {result['score']:.3f})")

Indexing Algorithm and Index Size

Different algorithms optimize for different corpus sizes:

  • Flat (brute-force): Up to 1M vectors. Query time O(N×d); every query scans all vectors.
  • HNSW (Hierarchical NSW): Up to 100M vectors. Query time O(log N); fast, in-memory.
  • IVF (Inverted File): Up to 10B vectors. Query time O(Nprobe + log(num_clusters)); memory-efficient.
  • Product Quantization + IVF: Up to 100B vectors. Query time fast, recall slightly lower due to compression.

Pinecone and Weaviate abstract algorithm choice; FAISS requires manual selection.

Key Takeaways

  • A vector search index has five components: corpus, embedding model, vector database, indexing algorithm, and query pipeline.
  • Embed documents first: Use a pre-trained model (all-MiniLM-L6-v2 for speed, text-embedding-3-small for quality).
  • Choose a database: Pinecone for ease, FAISS for control and cost, Weaviate for balance.
  • Normalize embeddings: Essential for cosine similarity; ensure normalize_embeddings=True in encoding.
  • Batch encode: 32–128 documents per batch to leverage GPU; 100K documents takes 5–10 minutes.

Frequently Asked Questions

How long does it take to build an index of 1 million vectors?

Embedding: 20–40 minutes on a GPU. Uploading to cloud database (Pinecone): 10–20 minutes (batched API calls). Total: 30–60 minutes. FAISS on local SSD: 5 minutes total.

Can I update vectors in the index without rebuilding?

Yes. Upsert new/modified vectors by ID. Pinecone and Weaviate support upsert in O(1) time. FAISS requires index rebuild (minutes). Most systems do full rebuilds weekly and incremental upserts daily.

What is the query latency?

Pinecone: 50–200 ms (includes API overhead). FAISS: 1–10 ms (local). Weaviate: 10–100 ms (depends on deployment).

Do I need to normalize embeddings when building an index?

Yes. Always normalize if using cosine similarity. Many libraries do this automatically, but always verify.

How do I handle documents longer than the embedding model's token limit?

Chunk documents (e.g., 512 tokens per chunk), embed each chunk separately, and retrieve top-k chunks (not documents). Rerank by document to avoid duplicate results.

Further Reading