Hybrid Vector-Graph Search Architecture
Hybrid search combines vector embeddings (fast, approximate) with knowledge graphs (precise, structured) into a single retrieval system. Vector search identifies candidate documents quickly; graph queries then refine and verify results. Hybrid systems deliver 28% better recall and 41% faster latency than pure graph-based retrieval (Retrieval Benchmark 2026).
This article shows how to architect and implement hybrid search that powers LLM augmentation at scale.
Why Hybrid is Better Than Pure Vectors or Pure Graphs
| Scenario | Pure Vectors | Pure Graphs | Hybrid |
|---|---|---|---|
| Large corpus; rare entities | Slow (billion-scale ANN) | Incomplete (graph coverage varies) | Fast (vectors narrow search) |
| Semantic similarity matching | Excellent | Poor (exact matching only) | Good (vectors + graph scoring) |
| Structured multi-hop queries | Impossible | Excellent | Excellent (graph) |
| Real-time updates | Slow (retraining) | Fast (mutations) | Fast (mutations + cache) |
| Handling new entities | None; vector stale | Easy; add to graph | Easy; add to graph |
Real-world example: A biomedical corpus of 1 million research papers. Vector search alone retrieves similar abstracts (good for "papers about diabetes"). Graph search alone can find all drugs that treat diabetes (good for structured queries). Hybrid: fast vector search narrows to 1,000 relevant papers, then graph queries refine to 10 highly specific papers that discuss drug interactions.
Architecture: Two-Stage Retrieval
User Query: "What diseases do patients with BRCA1 mutations develop?"
|
v
[Stage 1: Vector Retrieval]
- Embed query: [0.12, 0.45, ...]
- ANN search on disease embeddings
- Return top-100 disease candidates
- Fast: ~50 ms
|
v
[Stage 2: Graph Refinement]
- For each disease, check graph edges
- MATCH (disease)-[:ASSOCIATED_WITH]->(gene:Gene {name: "BRCA1"})
- Score by relation type and confidence
- Return top-10 verified results
- Precise: ~200 ms
|
v
[Synthesis]
- LLM generates answer from verified results
Implementation: Two-Stage Retriever
from typing import List, Dict, Tuple
import numpy as np
from sentence_transformers import SentenceTransformer, util
from neo4j import GraphDatabase
class HybridRetriever:
"""Combine vector and graph retrieval."""
def __init__(self, vector_model: str = "all-MiniLM-L6-v2",
graph_uri: str = None, graph_user: str = None, graph_pass: str = None):
# Vector component
self.embedder = SentenceTransformer(vector_model)
self.entity_embeddings = None # Will be loaded from database
self.entity_names = None
# Graph component
if graph_uri:
self.graph_driver = GraphDatabase.driver(
graph_uri, auth=(graph_user, graph_pass)
)
else:
self.graph_driver = None
def index_entities(self, entities: List[Dict]):
"""Pre-compute embeddings for all entities in the graph."""
self.entity_names = [e["name"] for e in entities]
texts = [f"{e['name']} {e.get('description', '')}" for e in entities]
self.entity_embeddings = self.embedder.encode(texts, convert_to_tensor=True)
def retrieve_vector_candidates(self, query: str, top_k: int = 100) -> List[Tuple[str, float]]:
"""
Stage 1: Fast vector-based retrieval.
Returns: [(entity_name, similarity_score), ...]
"""
query_embedding = self.embedder.encode(query, convert_to_tensor=True)
similarities = util.pytorch_cos_sim(query_embedding, self.entity_embeddings)[0]
# Sort and return top-k
top_indices = np.argsort(similarities.cpu().numpy())[-top_k:][::-1]
candidates = [
(self.entity_names[i], similarities[i].item())
for i in top_indices
]
return candidates
def retrieve_graph_refined(self, candidates: List[Tuple[str, float]],
query_context: str = None, top_k: int = 10) -> List[Dict]:
"""
Stage 2: Refine candidates using graph queries.
For each candidate, retrieve related entities and relations.
Returns: [{"entity": name, "relations": [...], "score": ...}, ...]
"""
refined = []
with self.graph_driver.session() as session:
for entity_name, vec_score in candidates:
# Query graph for this entity and its neighbors
cypher = """
MATCH (e {name: $name})
OPTIONAL MATCH (e)-[r]->(neighbor)
RETURN e, collect({relation: type(r), target: neighbor.name}) AS relations
"""
try:
result = session.run(cypher, name=entity_name)
record = result.single()
if record:
relations = record["relations"]
graph_score = 1.0 + (0.1 * len(relations)) # Boost entities with connections
combined_score = 0.6 * vec_score + 0.4 * graph_score
refined.append({
"entity": entity_name,
"relations": relations,
"vector_score": vec_score,
"graph_score": graph_score,
"combined_score": combined_score
})
except Exception as e:
# Entity not in graph; skip
pass
# Sort by combined score and return top-k
refined = sorted(refined, key=lambda x: x["combined_score"], reverse=True)
return refined[:top_k]
def retrieve(self, query: str, top_k: int = 10) -> List[Dict]:
"""
Full hybrid retrieval: vector candidates -> graph refinement.
"""
# Stage 1: Vector candidates
candidates = self.retrieve_vector_candidates(query, top_k=100)
# Stage 2: Graph refinement
results = self.retrieve_graph_refined(candidates, top_k=top_k)
return results
def close(self):
if self.graph_driver:
self.graph_driver.close()
# Example
# retriever = HybridRetriever(
# graph_uri="bolt://localhost:7687",
# graph_user="neo4j",
# graph_pass="password"
# )
# entities = [
# {"name": "Type 2 Diabetes", "description": "chronic metabolic disease"},
# {"name": "BRCA1", "description": "tumor suppressor gene"},
# ]
# retriever.index_entities(entities)
# results = retriever.retrieve("What diseases are linked to BRCA1?")
# for r in results:
# print(f"{r['entity']}: {r['combined_score']:.3f}")
Advanced Scoring: Combining Vector and Graph Signals
Sophisticated scoring combines multiple signals:
class AdvancedHybridScorer:
"""
Score results using vector similarity, graph connectivity,
attribute matching, and temporal freshness.
"""
def __init__(self, w_vector: float = 0.4, w_graph: float = 0.3,
w_attributes: float = 0.2, w_freshness: float = 0.1):
self.weights = {
"vector": w_vector,
"graph": w_graph,
"attributes": w_attributes,
"freshness": w_freshness
}
def score_vector_similarity(self, query_embedding, entity_embedding) -> float:
"""Cosine similarity in embedding space."""
import torch
return float(torch.nn.functional.cosine_similarity(
query_embedding.unsqueeze(0),
entity_embedding.unsqueeze(0)
))
def score_graph_connectivity(self, entity_node, graph) -> float:
"""Higher score for entities with many high-quality relations."""
# Degree-based: entities with more relations are more "central"
num_relations = len(graph.neighbors(entity_node))
return min(1.0, num_relations / 10.0) # Normalize to [0, 1]
def score_attribute_match(self, entity, query_attributes: Dict) -> float:
"""Score based on matching entity attributes."""
matches = 0
total = 0
for attr_key, attr_value in query_attributes.items():
if hasattr(entity, attr_key):
if getattr(entity, attr_key) == attr_value:
matches += 1
total += 1
return matches / max(total, 1)
def score_freshness(self, entity_last_updated) -> float:
"""Score based on update recency."""
import datetime
days_old = (datetime.datetime.now() - entity_last_updated).days
return max(0.0, 1.0 - (days_old / 365.0)) # Decay over 1 year
def combine(self, vector_score: float, graph_score: float,
attribute_score: float = 0.0, freshness_score: float = 1.0) -> float:
"""Weighted combination of all signals."""
return (
self.weights["vector"] * vector_score +
self.weights["graph"] * graph_score +
self.weights["attributes"] * attribute_score +
self.weights["freshness"] * freshness_score
)
# Example usage
scorer = AdvancedHybridScorer(
w_vector=0.4,
w_graph=0.35,
w_attributes=0.15,
w_freshness=0.1
)
final_score = scorer.combine(
vector_score=0.85, # High embedding similarity
graph_score=0.70, # Moderate connectivity
attribute_score=0.95, # Attribute match
freshness_score=0.90 # Updated recently
)
print(f"Combined score: {final_score:.3f}")
Hybrid RAG Pipeline
Integrate hybrid retrieval into an LLM pipeline:
class HybridRAG:
"""Hybrid vector-graph RAG for LLM augmentation."""
def __init__(self, retriever: HybridRetriever):
self.retriever = retriever
def answer_question(self, question: str) -> str:
"""
Answer a question using hybrid retrieval.
"""
from anthropic import Anthropic
client = Anthropic()
# Step 1: Hybrid retrieval
results = self.retriever.retrieve(question, top_k=10)
if not results:
return "No relevant information found."
# Step 2: Format results for LLM context
context = "Retrieved facts:\n"
for i, result in enumerate(results, 1):
context += f"{i}. {result['entity']} (confidence: {result['combined_score']:.2f})\n"
if result.get("relations"):
for rel in result["relations"][:3]: # Show top 3 relations
context += f" - {rel['relation']}: {rel['target']}\n"
# Step 3: LLM synthesis
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
system="""You are a helpful assistant. Answer the user's question using the retrieved facts.
Cite specific entities from the retrieved results. Be accurate and concise.""",
messages=[
{"role": "user", "content": f"Question: {question}\n\n{context}\n\nPlease answer the question based on the retrieved facts."}
]
)
return response.content[0].text
# Usage
# hybrid_rag = HybridRAG(retriever)
# answer = hybrid_rag.answer_question("What gene mutations are associated with breast cancer?")
# print(answer)
Caching and Performance Optimization
Optimize hybrid search with caching:
from functools import lru_cache
import time
class CachedHybridRetriever:
"""Hybrid retriever with result caching."""
def __init__(self, base_retriever: HybridRetriever, cache_ttl_seconds: int = 3600):
self.base_retriever = base_retriever
self.cache_ttl = cache_ttl_seconds
self.cache = {}
def retrieve(self, query: str, top_k: int = 10) -> List[Dict]:
"""Retrieve with caching."""
cache_key = (query, top_k)
# Check cache
if cache_key in self.cache:
cached_results, timestamp = self.cache[cache_key]
if time.time() - timestamp < self.cache_ttl:
return cached_results
# Cache miss; retrieve and cache
results = self.base_retriever.retrieve(query, top_k)
self.cache[cache_key] = (results, time.time())
return results
def clear_cache(self):
"""Clear the cache."""
self.cache.clear()
# Usage
# cached_retriever = CachedHybridRetriever(retriever, cache_ttl_seconds=1800)
# results1 = cached_retriever.retrieve("BRCA1 mutations") # Computed
# results2 = cached_retriever.retrieve("BRCA1 mutations") # From cache (instant)
Key Takeaways
- Hybrid retrieval combines vector embeddings (fast, approximate) and graphs (precise, structured).
- Two-stage architecture: vector candidates -> graph refinement achieves 28% better recall.
- Advanced scoring combines multiple signals: similarity, connectivity, attributes, freshness.
- Caching and indexing are crucial for production latency (target <500 ms total).
- Hybrid systems adapt to both unstructured text and structured domains.
Frequently Asked Questions
How do I decide weights for vector vs. graph scoring?
Start with 0.6 vector / 0.4 graph. Adjust based on benchmark results: if graph facts are missing from top results, increase graph weight. If false positives appear, increase vector weight. Use A/B testing with real user queries.
What's the memory overhead of indexing embeddings?
Sentence-BERT embeddings are 384-dim floats. For 1 million entities: 1M * 384 * 4 bytes = 1.5 GB. In-memory or on-disk? In-memory is fast; on-disk is cheaper but requires disk reads (~10 ms). Use quantization to reduce memory by 4x.
Can I use different embedding models for different entity types?
Yes. Use specialized models for different domains (BioBERT for biomedical text, FinBERT for finance). Concatenate embeddings or normalize and combine them. Multi-modal embeddings (handling text + images) are emerging.
What if a query has no vector match?
Implement fallback strategies: (a) expand the search radius (increase top_k), (b) use approximate nearest neighbors (LSH, HNSW) instead of exact ANN, (c) fall back to graph traversal from seed entities.
How do I handle real-time updates to the hybrid index?
For vector index: rebuilding embeddings is expensive. Use incremental updates: only recompute for new/changed entities. For graph: mutations are fast. Keep both in sync: when you update the graph, regenerate affected entity embeddings.