Skip to main content

Context Retrieval for Agents: Smart Code Selection

Retrieving the right context is harder than it seems. A naive search for "user authentication" might return 50 files. An agent cannot reason about all 50 in its context window. Which ones matter? A smarter retrieval system ranks files by relevance, expands to related code (if you're editing user login, you probably need the password validation function too), and stops before hitting the context limit. This article covers retrieval strategies from keyword-based to dependency-aware.

Context retrieval is the second-most critical system in an agent (after indexing). Poor retrieval = agent hallucination. Perfect retrieval = agent focus and accuracy.

Retrieval Problem: What Matters?

When an agent needs to "add two-factor authentication to the login endpoint," the retrieval system should return:

  1. Primary target: The login endpoint code.
  2. Dependencies: Database schema, password validation function, session management.
  3. Related code: Existing authentication patterns, error handling conventions.
  4. Test fixtures: Test files that validate authentication behavior.

But it should not return the user profile endpoint, billing module, or unrelated admin code—even if they contain the word "authentication."

A good retrieval system answers: Given this task, what is the minimal set of files/functions the agent needs to succeed?

Strategy 1: Keyword + Frequency Ranking

The simplest retrieval uses TF-IDF (Term Frequency-Inverse Document Frequency) to rank files by keyword relevance:

from collections import defaultdict
import math

class TFIDFRetriever:
"""Rank code by keyword relevance."""

def __init__(self, index: dict):
"""Index is {filepath: content_text}."""
self.index = index
self.idf = {}
self.compute_idf()

def compute_idf(self):
"""Compute IDF for all terms."""
doc_freq = defaultdict(int)
for doc in self.index.values():
tokens = set(doc.lower().split())
for token in tokens:
doc_freq[token] += 1

total_docs = len(self.index)
for token, freq in doc_freq.items():
self.idf[token] = math.log(total_docs / freq)

def retrieve(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
"""Retrieve top-k files by TF-IDF score."""
query_tokens = query.lower().split()
scores = {}

for filepath, content in self.index.items():
score = 0.0
tokens = content.lower().split()

# TF: how many times query tokens appear in doc
tf = sum(tokens.count(token) for token in query_tokens)

# IDF: inverse frequency of query tokens
idf_sum = sum(self.idf.get(token, 0) for token in query_tokens)

score = tf * idf_sum
scores[filepath] = score

# Return top-k by score
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return ranked[:top_k]

Limitations:

  • Doesn't understand code structure (functions, classes).
  • Keyword-blind (returns wrong results for synonyms: "authenticate" vs. "login").
  • No dependency awareness.

When to use: Large codebases, when indexing is too expensive.

Strategy 2: Semantic Ranking with Fallback

Combine semantic search (embeddings) with keyword ranking. Query via both, fuse the rankings:

import numpy as np

def retrieve_with_fusion(query: str,
semantic_index: dict,
keyword_index: dict,
top_k: int = 5,
semantic_weight: float = 0.7) -> list[str]:
"""
Retrieve using semantic + keyword search, fused by weighted ranking.
"""
client = Anthropic()

# Semantic search: embed the query and find nearest neighbors
query_embedding = client.embeddings.create(
model="claude-embedding-3",
input=[query]
).data[0].embedding

semantic_scores = {}
for i, embedding in enumerate(semantic_index["embeddings"]):
sim = np.dot(query_embedding, embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(embedding)
)
filepath = semantic_index["metadata"][i]["filepath"]
semantic_scores[filepath] = semantic_scores.get(filepath, 0) + sim

# Keyword search: TF-IDF ranking
keyword_scores = dict(keyword_index.retrieve(query, top_k=top_k*2))

# Fuse: weighted combination
all_files = set(semantic_scores.keys()) | set(keyword_scores.keys())
fused_scores = {}

for filepath in all_files:
semantic = semantic_scores.get(filepath, 0)
keyword = keyword_scores.get(filepath, 0) / (
max(keyword_scores.values()) + 1e-9 # normalize
)
fused_scores[filepath] = (
semantic_weight * semantic +
(1 - semantic_weight) * keyword
)

ranked = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
return [f for f, _ in ranked[:top_k]]

Advantages:

  • Semantic search catches meaning matches (synonyms, concepts).
  • Keyword search as fallback for exact matches and code specifics.
  • Weights can be tuned per task.

Trade-off:

  • Slightly higher latency (two searches per query).

Strategy 3: Dependency-Aware Expansion

When retrieving a file, also retrieve files it depends on and files that depend on it. This ensures the agent sees the full context of a change:

def retrieve_with_dependencies(
primary_file: str,
dependency_graph: dict, # {file: [imported_files]}
retriever,
max_files: int = 10
) -> list[str]:
"""
Retrieve primary file + its dependencies + dependents (up to max_files).
"""
retrieved = {primary_file}
to_visit = [primary_file]
depth = 0
max_depth = 2 # Expand up to 2 hops to avoid explosion

while to_visit and len(retrieved) < max_files and depth < max_depth:
current = to_visit.pop(0)

# Forward: files that current imports
if current in dependency_graph:
for dep in dependency_graph[current][:3]: # Limit to 3 per file
if dep not in retrieved and len(retrieved) < max_files:
retrieved.add(dep)
to_visit.append(dep)

# Backward: files that import current
for other, imports in dependency_graph.items():
if current in imports and other not in retrieved:
if len(retrieved) < max_files:
retrieved.add(other)
to_visit.append(other)

depth += 1

return list(retrieved)

# Usage: Agent needs to edit 'auth.py'
deps = retrieve_with_dependencies(
primary_file="src/auth.py",
dependency_graph=build_dependency_graph(codebase),
retriever=semantic_retriever,
max_files=10
)
# deps might be: [auth.py, password_hash.py, user_model.py, session.py, ...]

Advantages:

  • Catches implicit dependencies (database models, configuration).
  • Prevents agent from breaking dependent code unknowingly.
  • Natural expansion based on code structure.

Drawback:

  • Graph construction requires good indexing.
  • Can retrieve too much if dependency graph is dense.

Strategy 4: Query Expansion and Reranking

For complex tasks, expand the query into multiple sub-queries, retrieve for each, and rerank the union:

class ExpandingRetriever:
"""Query expansion: ask the LLM to generate related queries."""

def __init__(self, client, index):
self.client = client
self.index = index

def expand_query(self, query: str) -> list[str]:
"""Use LLM to generate semantically similar queries."""
prompt = f"""
Given this task: "{query}"
Generate 3 related search queries that would help find relevant code.
Return only the queries, one per line, no numbering.
"""

response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)

expanded = response.content[0].text.strip().split('\n')
return [q.strip() for q in expanded if q.strip()]

def retrieve_expanded(self, query: str, top_k: int = 5) -> list[str]:
"""Retrieve using original query + expanded queries."""
queries = [query] + self.expand_query(query)

all_results = {}
for q in queries:
results = self.index.semantic_search(q, top_k=top_k)
for rank, (filepath, score) in enumerate(results):
# Accumulate scores: first result gets score 5, second gets 4, etc.
all_results[filepath] = all_results.get(filepath, 0) + (top_k - rank)

# Return top-k by accumulated score
ranked = sorted(all_results.items(), key=lambda x: x[1], reverse=True)
return [f for f, _ in ranked[:top_k]]

Advantages:

  • Captures multi-faceted tasks (e.g., "add auth + logging" → searches for both).
  • Higher recall (fewer missed relevant files).
  • LLM-powered intelligence (understands context).

Cost:

  • Extra LLM calls (slower, more expensive).

Ranking: How to Score Relevance

Once you retrieve candidates, rank them. A good ranking function considers:

FactorWeightReasoning
Semantic similarity40%Embeddings capture meaning.
Keyword match20%Exact terminology matters in code.
Dependency degree15%Core modules (high in-degree) are more likely relevant.
Recency15%Recently modified files are more active.
Popularity (calls count)10%Frequently used functions are more central.
def rank_candidates(candidates: list[str], 
rankings: dict) -> list[str]:
"""Combine multiple ranking signals."""
scores = {}

for filepath in candidates:
score = (
0.40 * rankings["semantic"].get(filepath, 0) +
0.20 * rankings["keyword"].get(filepath, 0) +
0.15 * rankings["dependency"].get(filepath, 0) +
0.15 * rankings["recency"].get(filepath, 0) +
0.10 * rankings["popularity"].get(filepath, 0)
)
scores[filepath] = score

return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Key Takeaways

  • Retrieval is the bridge between indexing and agent reasoning: retrieve well, agents succeed.
  • Use semantic search for meaning, keyword search for precision, combine via fusion.
  • Dependency expansion ensures the agent sees related code it will inevitably need.
  • Query expansion (via LLM) captures multi-faceted tasks better than single-query retrieval.
  • Rank candidates by multiple signals: similarity, keyword match, dependency, recency, popularity.

Frequently Asked Questions

How many files should an agent retrieve?

Depend on context window. A 100k-token agent can handle 50–100 files comfortably. A 10k-token agent, 5–10 files. Retrieve just enough to reason, not so much the agent gets lost. Start with top-5, expand if the agent asks for more context.

What if the right file isn't in the top-k?

Use query expansion or ask the agent to search again with a different query. Agents can also call a search(query) tool directly, iterating if needed. The retrieval system is not always perfect; let agents adapt.

How do I handle codebase changes (new files, deletions)?

Re-index incrementally. Git hooks can trigger a re-index on every commit, updating the index in seconds. For huge codebases, batch updates (every 1–4 hours) are fine; agents can work with slightly stale indexes.

Can I retrieve documentation + code together?

Yes. Index docstrings, README files, and API docs the same way as code. In retrieval, blend code results with documentation results. Agents benefit from both.

Further Reading