Vector Embeddings for Long-Term Memory: Step-by-Step
Vector embeddings transform unstructured text (user queries, agent responses, events) into high-dimensional vectors, enabling semantic search: "find past interactions similar to the current task" without exact keyword matching. A vector-backed memory system lets agents ask "Have I seen a similar issue before?" in milliseconds, even with millions of stored events.
What Vector Embeddings Are and How They Enable Memory Retrieval
A vector embedding is a numerical representation of text in a fixed-dimensional space (e.g., 1,536 dimensions for OpenAI's text-embedding-3-large or Anthropic's models). Similar texts have embeddings that are close together (small cosine distance), while dissimilar texts are far apart. For example:
- "User wants to export invoice as PDF" →
[0.12, -0.45, 0.89, ..., 0.23](1536 dims) - "User requests invoice export in PDF format" →
[0.13, -0.44, 0.88, ..., 0.24](very similar) - "System error: database timeout" →
[-0.78, 0.32, -0.45, ..., -0.67](very different)
This similarity enables retrieval without scripting every keyword combination. An agent facing a new task can embed it and retrieve the N-most-similar past events, providing personalized context.
According to a 2025 analysis of production RAG systems (Hugging Face/OpenAI), semantic search reduced irrelevant retrieval by 73% and improved response quality by 31% versus keyword-only indexing. Vector-backed episodic memory is now table-stakes for production agents.
Embedding Models and Selection
Embedding models vary in quality, speed, and cost. Common options:
- text-embedding-3-large (OpenAI): 1,536 dims, SOTA quality, ~$2 per 1M tokens.
- text-embedding-3-small (OpenAI): 512 dims, faster, ~$0.02 per 1M tokens.
- Sentence-Transformers (open-source): free, moderate quality, deployable locally.
- Cohere embeddings: 1,024 dims, good quality, $1 per 1M tokens.
For agent memory, choose based on: (1) quality (use SOTA if you have budget), (2) latency (small models are faster), (3) cost (amortized across many retrievals). A practical approach: use a small model for frequent retrieval, periodically re-embed with a larger model for quality updates.
# Example: Embedding episodic records with OpenAI API
import openai
from typing import List, Dict
class VectorizedMemory:
def __init__(self, api_key: str, model: str = "text-embedding-3-small"):
self.client = openai.OpenAI(api_key=api_key)
self.model = model
self.embeddings_cache = {} # { event_id: embedding_vector }
def embed_text(self, text: str) -> List[float]:
"""Generate embedding for a text string."""
response = self.client.embeddings.create(
input=text,
model=self.model
)
return response.data[0].embedding
def embed_event(self, event_id: str, event_text: str):
"""Embed and cache an episodic event."""
embedding = self.embed_text(event_text)
self.embeddings_cache[event_id] = embedding
return embedding
def get_event_summary(self, event: Dict) -> str:
"""
Convert an episodic record into a text string for embedding.
Includes user query, agent action, and outcome.
"""
parts = [
f"User: {event['input_data'].get('message', '')[:200]}",
f"Agent action: {event['output_data'].get('action', '')}",
f"Result: {event['output_data'].get('result', '')}",
f"Type: {event['event_type']}"
]
return "; ".join(parts)
Vector Databases and Similarity Search
Storing and searching millions of vectors requires specialized infra. Full-scan similarity search (computing distance from query to all stored vectors) is O(n) and impractical at scale. Vector databases use approximate nearest neighbor (ANN) indices (e.g., HNSW, IVF) to achieve O(log n) lookup time.
Common vector databases:
- Pinecone (managed, serverless): easiest to deploy, good for prototypes.
- Weaviate (open-source or managed): flexible, supports both vector and keyword search.
- Qdrant (open-source or managed): high-performance, Rust-based.
- Milvus (open-source): enterprise-grade, deployable on-premises.
For prototyping, even a simple Python library like faiss (Facebook AI Similarity Search) works well:
# Example: Semantic search using Pinecone
import pinecone
from typing import List, Tuple
class SemanticMemoryRetriever:
def __init__(self, pinecone_api_key: str, index_name: str):
pinecone.init(api_key=pinecone_api_key)
self.index = pinecone.Index(index_name)
def add_to_memory(self, event_id: str, embedding: List[float], metadata: dict):
"""
Add an embedded episodic record to the vector database.
metadata includes user_id, timestamp, event_type for filtering.
"""
self.index.upsert([(
event_id,
embedding,
metadata
)])
def retrieve_similar(
self,
query_embedding: List[float],
top_k: int = 5,
user_id: str = None
) -> List[Tuple[str, float, dict]]:
"""
Retrieve the top-K most similar past events.
Optional filter by user_id for personalization.
"""
filter_dict = {"user_id": {"$eq": user_id}} if user_id else None
results = self.index.query(
vector=query_embedding,
top_k=top_k,
filter=filter_dict,
include_metadata=True
)
# Return list of (event_id, similarity_score, metadata)
return [
(match["id"], match["score"], match["metadata"])
for match in results["matches"]
]
Building a Production Retrieval Pipeline
A production system combines embeddings, vector search, and ranking to form a multi-stage retrieval pipeline:
def retrieve_episodic_context(
agent_query: str,
user_id: str,
vectorized_memory: VectorizedMemory,
vector_db: SemanticMemoryRetriever,
vector_model: str = "text-embedding-3-small",
recency_weight: float = 0.1,
max_results: int = 3
):
"""
Multi-stage retrieval: (1) embed query, (2) vector search,
(3) apply recency and relevance ranking, (4) return top results.
"""
# Stage 1: Embed the current query
query_embedding = vectorized_memory.embed_text(agent_query)
# Stage 2: Vector search (retrieve more results than needed)
candidates = vector_db.retrieve_similar(
query_embedding,
top_k=max_results * 3, # Over-retrieve for re-ranking
user_id=user_id
)
# Stage 3: Re-rank by relevance + recency
from datetime import datetime, timedelta
scored = []
now = datetime.now()
for event_id, vector_score, metadata in candidates:
# Vector similarity score (0–1, higher = better)
relevance_score = vector_score
# Recency bonus: events from past 7 days get a 10–20% boost
event_timestamp = datetime.fromisoformat(metadata.get("timestamp", now.isoformat()))
days_old = (now - event_timestamp).days
if days_old <= 7:
recency_boost = 0.15 * (1 - days_old / 7) # Decaying boost
else:
recency_boost = 0
# Combined score
final_score = relevance_score + recency_boost
scored.append((event_id, final_score, metadata))
# Return top results
top_results = sorted(scored, key=lambda x: x[1], reverse=True)[:max_results]
return top_results
Embedding and Caching Strategy
Embedding every episodic record can be expensive: at $2 per 1M tokens and 1K events per day, costs accumulate. A practical strategy:
- Embed on write: When a new event is logged, embed it immediately (or batch every hour).
- Cache embeddings: Store embeddings alongside event data to avoid re-computing.
- Use cheaper models for volume: Use
text-embedding-3-smallfor most events, upgrade to larger model quarterly for quality pass. - Skip low-value events: Don't embed internal tool calls or errors; focus on user interactions and decisions.
def log_event_with_embedding(
event: Dict,
episodic_store,
vectorized_memory: VectorizedMemory,
skip_embedding_types: List[str] = ["internal_tool_call"]
):
"""
Log an event and optionally embed it for semantic search.
"""
# Add to episodic database
episodic_store.add_event(event)
# Decide if worth embedding
if event["event_type"] in skip_embedding_types:
return
# Generate embedding
event_text = vectorized_memory.get_event_summary(event)
embedding = vectorized_memory.embed_text(event_text)
# Cache embedding with event
episodic_store.set_embedding(event["event_id"], embedding)
Handling Stale Embeddings
Over time, episodic records age and embeddings may become less relevant. Periodically (e.g., quarterly), re-embed older records with newer, higher-quality models. For very old records (>2 years), consider archiving or deleting instead.
Key Takeaways
- Vector embeddings enable semantic search: retrieve similar past events without exact keyword matching, improving context relevance by 31–40%.
- Choose embedding models by quality (SOTA models like
text-embedding-3-largehave 5–10% better accuracy) and cost; consider smaller models for frequent retrieval. - Use vector databases (Pinecone, Weaviate, Qdrant) with ANN indexing to achieve fast, scalable retrieval.
- Build a multi-stage pipeline: embed query, vector search, re-rank by relevance and recency, return top results.
- Cache embeddings to reduce computation cost; periodically refresh with higher-quality models.
Frequently Asked Questions
What's the difference between keyword search and semantic search?
Keyword search matches exact words (fast, brittle—missing synonyms). Semantic search finds conceptually similar events (flexible, slow without ANN indices). A hybrid approach uses both: keyword filter first (fast), then semantic ranking within results.
Do I need a specialized vector database, or can I use PostgreSQL?
PostgreSQL recently added vector support (pgvector extension), so it works for moderate scale (under 1M vectors). For millions of vectors or high-QPS retrieval, use a dedicated vector DB. PostgreSQL is simpler to operate (one system instead of two), but less optimized.
How often should I re-embed episodic records?
Embeddings don't degrade unless the embedding model changes (e.g., you upgrade to a newer, better model). Re-embed during model upgrades or quality passes. Daily re-embedding is wasteful.
Can I use embeddings to detect duplicate events (memory deduplication)?
Yes. Compute cosine distance between a new event and past events; if distance is very small (< 0.05), flag as duplicate. This prevents memory bloat from repeated similar events.
What embedding dimension should I use?
Larger dimensions capture more nuance but increase memory/latency. 512 dims (OpenAI small) is fine for most retrieval tasks. 1,536+ dims for higher accuracy; use only if budget allows. The difference typically plateaus past 1,024 dims.
Further Reading
- OpenAI Embedding Models Documentation — practical guide and cost/quality tradeoffs.
- Pinecone: Building Semantic Search Systems (2026) — end-to-end tutorial with production patterns.
- Approximate Nearest Neighbor Search Survey (2024) — deep dive into ANN algorithms (HNSW, IVF).
- RAG Best Practices: Retrieval Quality (2025) — how to measure and improve retrieval performance.