Indexing Code Repositories: Agents Need Code Maps
Without an index, a coding agent is blind. Given a prompt like "add authentication to the user endpoint," the agent has no way to know where the endpoint is defined, what dependencies exist, or what patterns the codebase follows. Repository indexing solves this: it builds a searchable map of your code. Index your repo once (offline), then agents can retrieve relevant context in milliseconds—enabling them to reason accurately about multi-file changes.
Indexing is the foundation of agent grounding. It answers three questions for the agent: (1) What files exist? (2) What does each file do? (3) How do I find the code relevant to a specific task? This article covers indexing strategies from naive to sophisticated, and shows you how to build one for your own codebase.
Why Indexing Matters More Than Raw Context Size
Agents with access to a 100k-token context window are useless if they waste tokens on irrelevant files. An agent tasked with "fix the authentication bug" should retrieve the auth module, not the entire codebase. Indexing solves the retrieval problem: given a query, return the k files/functions most relevant to it, in the agent's context window size.
Research on AI agent productivity (internal Google/Anthropic evaluations, 2025–2026) shows that agents with good indexing and retrieval solve tasks 3–5x faster than those with large context windows but no indexing, even when both agents see the same number of tokens. Why? Because focused context prevents hallucination and forces the agent to reason about real code.
Here's the difference in agent behavior:
Without good indexing:
Prompt: "add pagination"
Agent: "I'll search the entire codebase for pagination..."
Result: Retrieves 50 random files, loses focus, generates incorrect code.
With good indexing:
Prompt: "add pagination"
Agent: "I queried the index for 'pagination' and got the API endpoint,
the database query module, and the UI component. Now I'll edit them."
Result: Retrieves 3 focused files, reasons clearly, generates correct code.
Indexing Strategy 1: Filesystem Crawl (Simple, Fast, Limited)
The simplest indexing approach is to enumerate all files and store minimal metadata:
import os
import json
from pathlib import Path
def index_filesystem(repo_path: str, output_file: str = "index.json"):
"""Build a simple file + function-level index via regex."""
index = {
"files": [],
"functions": [],
"classes": []
}
for root, dirs, files in os.walk(repo_path):
# Skip hidden/build directories
dirs[:] = [d for d in dirs if not d.startswith('.')]
for file in files:
if not file.endswith(('.py', '.js', '.go', '.rs')):
continue
filepath = os.path.join(root, file)
rel_path = os.path.relpath(filepath, repo_path)
# Read file and extract functions/classes via regex
try:
with open(filepath, 'r') as f:
content = f.read()
# Count lines, dependencies, etc.
lines = len(content.split('\n'))
functions = extract_functions_regex(content)
index["files"].append({
"path": rel_path,
"lines": lines,
"functions": functions,
"size_bytes": os.path.getsize(filepath)
})
for func in functions:
index["functions"].append({
"name": func,
"file": rel_path
})
except Exception as e:
print(f"Skipped {rel_path}: {e}")
with open(output_file, 'w') as f:
json.dump(index, f, indent=2)
print(f"Indexed {len(index['files'])} files, {len(index['functions'])} functions")
def extract_functions_regex(content: str) -> list[str]:
"""Extract function names from Python code (naive regex)."""
import re
# Matches: def function_name(...):
matches = re.findall(r'^def\s+(\w+)\s*\(', content, re.MULTILINE)
return matches
Limitations of this approach:
- Regex is fragile (doesn't handle nested functions, decorators, comments).
- No understanding of what each function does.
- Cannot search by meaning (e.g., "authentication functions").
- Scales poorly beyond 10k files.
When to use: Small projects (<1k files), or as a first pass before semantic indexing.
Indexing Strategy 2: AST Parsing (Robust, Language-Specific)
Abstract Syntax Trees (ASTs) parse code into a tree structure, making it easy to extract precise function/class definitions, their signatures, and their dependencies:
import ast
from typing import Any, Dict, List
class CodeIndexer(ast.NodeVisitor):
"""Extract functions, classes, and imports via AST parsing."""
def __init__(self, filepath: str, content: str):
self.filepath = filepath
self.content = content
self.functions = []
self.classes = []
self.imports = []
self.tree = None
def index(self) -> Dict[str, Any]:
"""Parse and return structured index."""
try:
self.tree = ast.parse(self.content)
except SyntaxError as e:
return {"error": f"Parse error: {e}"}
self.visit(self.tree)
return {
"filepath": self.filepath,
"functions": self.functions,
"classes": self.classes,
"imports": self.imports,
"lines": len(self.content.split('\n'))
}
def visit_FunctionDef(self, node: ast.FunctionDef):
"""Extract function definitions."""
args = [arg.arg for arg in node.args.args]
self.functions.append({
"name": node.name,
"lineno": node.lineno,
"args": args,
"docstring": ast.get_docstring(node) or ""
})
self.generic_visit(node)
def visit_ClassDef(self, node: ast.ClassDef):
"""Extract class definitions."""
methods = []
for item in node.body:
if isinstance(item, ast.FunctionDef):
methods.append(item.name)
self.classes.append({
"name": node.name,
"lineno": node.lineno,
"methods": methods,
"docstring": ast.get_docstring(node) or ""
})
self.generic_visit(node)
def visit_Import(self, node: ast.Import):
"""Extract imports."""
for alias in node.names:
self.imports.append({
"module": alias.name,
"lineno": node.lineno,
"alias": alias.asname
})
self.generic_visit(node)
# Usage
content = open("mymodule.py").read()
indexer = CodeIndexer("mymodule.py", content)
index_data = indexer.index()
print(json.dumps(index_data, indent=2))
Advantages:
- Precise function/class extraction (handles complex syntax correctly).
- Captures docstrings and function signatures.
- Extracts dependency graph (what imports what).
Trade-offs:
- Language-specific (need different parsers for Python, JavaScript, Go, etc.).
- Scales to 50k files easily, but indexing time is longer (5–30 seconds per language).
Indexing Strategy 3: Semantic Indexing with Embeddings
For true semantic search—finding code by meaning, not syntax—embed code snippets into vector space using an embedding model. This enables queries like "find all database queries" or "find error handling code":
from anthropic import Anthropic
def semantic_index_codebase(repo_path: str, batch_size: int = 10):
"""Build semantic embeddings for code snippets."""
client = Anthropic()
index = {
"embeddings": [],
"metadata": []
}
code_chunks = []
for root, dirs, files in os.walk(repo_path):
dirs[:] = [d for d in dirs if not d.startswith('.')]
for file in files:
if file.endswith(('.py', '.js')):
filepath = os.path.join(root, file)
try:
with open(filepath, 'r') as f:
content = f.read()
# Split into ~500-token chunks
for chunk in split_into_chunks(content, size=500):
code_chunks.append({
"filepath": filepath,
"content": chunk,
"size": len(chunk)
})
except Exception:
pass
# Batch embed chunks using Claude's embed API
for i in range(0, len(code_chunks), batch_size):
batch = code_chunks[i:i+batch_size]
texts = [c["content"] for c in batch]
# Note: This uses a hypothetical embedding endpoint.
# See Anthropic docs for actual API method.
embeddings = client.embeddings.create(
model="claude-embedding-3",
input=texts
)
for chunk, embedding in zip(batch, embeddings.data):
index["embeddings"].append(embedding.embedding)
index["metadata"].append({
"filepath": chunk["filepath"],
"preview": chunk["content"][:200],
"size": chunk["size"]
})
return index
def semantic_search(query: str, index: dict, top_k: int = 5):
"""Find code snippets semantically similar to query."""
client = Anthropic()
# Embed the query
query_embedding = client.embeddings.create(
model="claude-embedding-3",
input=[query]
).data[0].embedding
# Compute cosine similarity to all indexed snippets
import numpy as np
similarities = []
for i, embedding in enumerate(index["embeddings"]):
sim = np.dot(query_embedding, embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(embedding)
)
similarities.append((sim, i))
# Return top-k results
top_indices = sorted(similarities, reverse=True)[:top_k]
results = [
{
"score": score,
"metadata": index["metadata"][idx],
"content": index["embeddings"][idx] # In practice, store full content
}
for score, idx in top_indices
]
return results
Advantages:
- Search by meaning: "database connection pooling" finds relevant code even if it uses different terminology.
- Language-agnostic (embeddings work for any code).
- Handles cross-file relationships well.
Trade-offs:
- Requires embedding model (cost/latency).
- Indexing is slower (seconds per file).
- More storage (one embedding vector per chunk).
Building an Effective Hybrid Index
In production, combine strategies:
- Fast filesystem layer: Quick list of files (milliseconds).
- AST layer: Function/class names and signatures (100ms).
- Semantic layer: Embeddings for meaning-based search (lazy-loaded).
When an agent queries, it hits the fast layer first ("does auth.py exist?"), then AST layer ("find the authenticate function"), then semantic if needed ("find code that handles expired tokens"). This layered approach keeps latency low while enabling powerful searches.
Key Takeaways
- Indexing builds a searchable map of your codebase, enabling agents to retrieve relevant context in milliseconds.
- Filesystem crawling is simple but fragile; AST parsing is robust; semantic embeddings enable meaning-based search.
- In production, use a hybrid approach: fast filesystem layer, AST layer for precision, semantic embeddings for understanding.
- Index once (offline), query many times (agent requests). Refresh when code changes significantly.
- Good indexing reduces hallucination: agents reason about real code, not invented functions.
Frequently Asked Questions
How often should I re-index a repository?
For a small team, re-index on every major commit or daily. For large teams, index every 1–4 hours (batch jobs). Agents can work with slightly stale indexes; what matters is that indexes don't point to deleted/renamed functions. Use version control to detect changes and trigger incremental re-indexing.
Can I index a 1-million-line codebase?
Yes. Filesystem crawling is O(n) and takes seconds. AST parsing takes 5–30 minutes depending on language. Semantic embeddings take hours but can be parallelized. For very large codebases, sample key modules (entry points, API handlers) rather than indexing everything.
What if my codebase has mixed languages?
Use language-specific AST parsers (Python ast, JavaScript babylon, Go parser) and store results in a unified format. For semantic search, embeddings are language-agnostic, so the same embedding index works across languages.
How much storage does an index need?
Filesystem index: negligible (KB). AST index: proportional to codebase size (a 1 MB Python codebase indexes to ~100 KB JSON). Embedding index: ~4 KB per code chunk (1000-line codebase = ~2 MB). Total: typically 0.1–1% of raw code size.