Skip to main content

Vector Database Schema Design: Step-by-Step

A vector database schema defines how you structure embeddings, metadata, and collections to enable efficient search and filtering in production. Unlike relational databases with fixed column schemas, vector databases store flexible payloads (metadata objects) alongside vectors. Designing schemas poorly leads to slow queries, failed filters, and operational headaches; designing them well enables fast, precise retrieval at scale.

Core Schema Components: Vectors, Payloads, and Collections

Every stored item in a vector database has three parts:

Vector: The embedding (e.g., 384-dimensional float array from a sentence encoder).

Payload (metadata): An object containing searchable and displayable data: document ID, source URL, creation date, category, author, rating, etc.

Point ID: A unique identifier for the vector-payload pair within a collection.

A collection groups related vectors. You might have one collection per tenant (SaaS), per data source (documents, images), or per application (search, recommendations).

Example structure for a document retrieval system:

{
"id": 12345,
"vector": [0.23, -0.41, 0.18, ...],
"payload": {
"doc_id": "doc_xyz_789",
"title": "Vector Databases in Production",
"content_chunk": "A vector database stores high-dimensional...",
"source_url": "https://example.com/blog/vectors",
"author": "Dr. Alex Turner",
"created_at": "2026-05-20T10:30:00Z",
"category": "tutorial",
"tags": ["databases", "ai", "production"],
"chunk_index": 2,
"rating": 4.8
}
}

Designing Payload Structure for Efficient Filtering

Payload fields are indexed and queryable. Choose them based on:

  1. What will users filter by? If your search interface allows "filter by date," "filter by category," or "filter by author," those fields must exist in the payload.

  2. What data must you return? Payloads are returned with search results. Include user-facing metadata (title, URL, rating) alongside internal metadata (document ID for logging).

  3. Cardinality: Fields with low cardinality (few unique values: categories like "tech", "news", "opinion") are cheap to filter. High-cardinality fields (millions of unique values) are expensive; avoid filtering on them if possible.

  4. Data type: Use appropriate types in your vector database to enable correct filtering:

    • Text (string): Full-text search or exact match filtering.
    • Number (int, float): Range queries (rating >= 4.0).
    • Keyword (enum): Exact matching only. More efficient than text for categories.
    • Datetime: Timestamp filtering (created_at > "2026-01-01").
    • Boolean: Simple flags (e.g., is_premium: true).
    • Array (nested): Lists of tags or categories.

Example Payload Schema in Qdrant

Here is a Qdrant payload definition for a document search system:

from qdrant_client import QdrantClient
from qdrant_client.models import (
VectorParams, Distance, CreatePayloadIndexRequest, PayloadSchemaType
)

client = QdrantClient("localhost", port=6333)

# Create a collection with vector and payload schema
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

# Index specific payload fields for filtering efficiency
client.create_payload_index(
collection_name="documents",
field_name="created_at",
field_schema=PayloadSchemaType.DATETIME,
)

client.create_payload_index(
collection_name="documents",
field_name="category",
field_schema=PayloadSchemaType.KEYWORD,
)

client.create_payload_index(
collection_name="documents",
field_name="rating",
field_schema=PayloadSchemaType.FLOAT,
)

Then, when upserting a point:

from qdrant_client.models import PointStruct
from datetime import datetime

client.upsert(
collection_name="documents",
points=[
PointStruct(
id=1,
vector=[0.23, -0.41, 0.18, ...], # 384-dim embedding
payload={
"doc_id": "doc_xyz_789",
"title": "Vector Databases in Production",
"source_url": "https://example.com/blog/vectors",
"created_at": datetime(2026, 5, 20, 10, 30, 0),
"category": "tutorial",
"rating": 4.8,
"tags": ["databases", "ai", "production"],
}
)
]
)

Vector databases support filtering at query time: return the top-k vectors matching a query embedding and satisfying metadata constraints. This enables precise, context-aware search.

For example, filter document search to only recent tutorials with high ratings:

from qdrant_client.models import Filter, FieldCondition, DatetimeRange, MatchValue

results = client.search(
collection_name="documents",
query_vector=[0.1, 0.2, 0.3, ...], # "vector database tutorial" embedding
query_filter=Filter(
must=[
FieldCondition(
key="category",
match=MatchValue(value="tutorial"),
),
FieldCondition(
key="created_at",
range=DatetimeRange(
gte=datetime(2026, 1, 1),
lte=datetime(2026, 6, 2),
),
),
FieldCondition(
key="rating",
range={"gte": 4.0},
),
]
),
limit=5,
)

This query returns the 5 vectors semantically closest to [0.1, 0.2, 0.3, ...] (query embedding for "vector database tutorial") only among documents that are tutorials, created in 2026, and have a rating >= 4.0.

Structuring Nested Payloads and Arrays

For complex metadata, nest objects and use arrays:

payload = {
"doc_id": "doc_123",
"title": "Advanced Vector Search",
"author": {
"name": "Alice Chen",
"email": "[email protected]",
"affiliation": "TechCorp"
},
"tags": ["vector search", "optimization", "production"],
"metadata": {
"version": 2,
"reviewed": True,
"review_date": "2026-05-15"
}
}

Filter on nested fields:

Filter(
must=[
FieldCondition(
key="author.affiliation",
match=MatchValue(value="TechCorp"),
),
FieldCondition(
key="tags",
match=MatchValue(value="production"),
),
]
)

Payload Size and Storage Implications

Each payload is stored alongside the vector and returned in search results. Large payloads increase latency and storage. Best practices:

  1. Minimize payload size: Store only essential metadata. If you need large unstructured data (full document text), store it separately in blob storage (S3, GCS) and include only a reference (URL, doc_id) in the payload.

  2. Compress strings: For large strings (>1 KB), consider storing a hash or truncated preview instead of the full text.

  3. Denormalize intelligently: Include frequently-displayed metadata (author, title, URL, rating) in the payload. Avoid including data you never display or filter on.

Example for a document RAG system:

# Inefficient: large payload
payload_bad = {
"doc_id": "doc_123",
"full_document_text": "..." # 10 KB of text
"summary": "..." # 2 KB
"metadata_json": "..." # 5 KB
}

# Efficient: reference external storage
payload_good = {
"doc_id": "doc_123",
"title": "Advanced Vector Search",
"s3_url": "s3://my-bucket/doc_123.txt",
"chunk_index": 2, # which chunk of the document this vector represents
"created_at": "2026-05-20",
}

Collections, Namespaces, and Multi-Tenancy

Most vector databases support logical partitioning:

  • Collections (Qdrant, Milvus): Create separate collections per data source or per tenant. Each collection has its own indexes and can scale independently.

  • Namespaces (Pinecone, Weaviate): Within a single index, partition vectors by namespace (e.g., one per tenant or per project).

For a multi-tenant SaaS RAG system:

# Create a collection per tenant
client.create_collection(
collection_name=f"documents_tenant_{tenant_id}",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

# Or use a shared collection with tenant_id in payload
payload = {
"tenant_id": "acme_corp",
"doc_id": "doc_123",
"title": "Q2 2026 Report",
}

# Filter by tenant during search
Filter(
must=[
FieldCondition(
key="tenant_id",
match=MatchValue(value="acme_corp"),
)
]
)

Schema Evolution in Production

Vector database schemas are flexible, but evolving them requires care:

  1. Adding fields: Upsert points with new fields; old points without the new field are unaffected.

  2. Removing fields: Simply stop including them in new upserts; old points retain the field but it is ignored.

  3. Changing data types: Not directly supported. Create a new collection with the new schema and re-ingest data.

  4. Indexing new payload fields: Create a payload index on the field; existing points are indexed automatically on the background.

Plan schema at design time to minimize changes in production.

Key Takeaways

  • Design payloads to match your query requirements: filter by fields users care about, return fields users see.
  • Index low-cardinality fields (categories, dates, ratings) for fast filtering; avoid indexing high-cardinality fields.
  • Minimize payload size by storing large data externally (S3) and including only references.
  • Use nested objects and arrays for complex metadata; filter on nested fields naturally.
  • Plan multi-tenancy via collections or namespace+payload filters; isolate critical workloads.

Frequently Asked Questions

Should I store the document ID in the payload or use it as the point ID?

Use the point ID for the vector's unique identifier and store a document ID in the payload. The point ID must be a positive integer (Qdrant, Milvus) or string (Weaviate, Pinecone). The document ID (which might be a UUID or slug) is user-facing and should live in the payload so you can return it in search results and log it.

How do I filter on multiple conditions (AND, OR)?

Use Filter with must (AND), should (OR), and must_not (NOT). Most vector databases support boolean filter expressions. Example: (category == "tutorial" OR category == "guide") AND rating >= 4.0 is must: [OR([category: tutorial, category: guide]), rating: gte 4.0] depending on API syntax.

Can I update a payload without re-uploading the vector?

Yes. All vector databases support payload-only updates. Query the point by ID, update the payload fields, and re-upsert. The vector embedding is not re-indexed unless you explicitly re-upload the vector.

What happens if I filter for conditions no vectors match?

The database returns an empty result set. Some databases return a warning or error; others return silently. Test your filters in development to ensure they return expected results.

Further Reading