Skip to main content

Access Control and Security in Knowledge Bases

A knowledge base without access control is a data breach waiting to happen. If your RAG system indexes proprietary customer data, medical records, or internal memos, and you forget to filter documents at retrieval time, users will see information they shouldn't. Access control in RAG means ensuring that each user retrieves only documents they are authorized to read. This is fundamentally different from application-level access control; here, filtering happens at the database layer, before documents reach the LLM. This article covers implementing user-based retrieval filtering, document-level permission tagging, and auditing access.

Why RAG Access Control Is Different

Traditional web applications use role-based access control (RBAC) at the application layer: you authenticate a user, check their role, and show them authorized pages. RAG systems have an additional layer: the retrieval engine. If a user with role "customer" triggers a query, the retrieval engine must filter the vector database to return only documents tagged for that role, before the LLM sees them. If you skip this step, the LLM may inadvertently leak private data by citing a document the user shouldn't access.

A 2024 incident at a major RAG-powered customer support system revealed that filtering was missing; support agents could query the knowledge base and inadvertently retrieve documents marked "CEO only", leaking strategic decisions to the chat transcript. The fix: implement access control at retrieval time, not prompting time.

Tagging Documents with Permissions

At indexing time, assign each document or chunk one or more permission tags. Common tag schemes:

  • Role-based: "admin", "employee", "customer", "public"
  • User-based: "user_id:123", "team:sales", "group:engineering"
  • Sensitivity-based: "public", "internal", "confidential", "secret"
  • Composite: "role:employee AND sensitivity:internal"

Here is how to tag documents during indexing:

from openai import OpenAI
import json

client = OpenAI()

def index_document_with_permissions(
document_text: str,
source: str,
permissions: list[str], # e.g., ["public", "role:employee"]
vector_db
) -> None:
"""Index a document with fine-grained permission tags."""

# Split document into chunks (see article 2)
chunks = chunk_document(document_text)

for i, chunk_text in enumerate(chunks):
# Embed chunk
embedding_response = client.embeddings.create(
input=chunk_text,
model="text-embedding-3-small"
)
embedding = embedding_response.data[0].embedding

# Index with metadata
chunk_id = f"{source}_chunk_{i}"
vector_db.upsert([
{
"id": chunk_id,
"values": embedding,
"metadata": {
"source": source,
"chunk_index": i,
"text": chunk_text,
"permissions": permissions, # Key: permission tags
"indexed_at": "2026-06-02T14:30:00Z"
}
}
])

# Example: index a document with role-based access
confidential_doc = """
Q3 Revenue Strategy: We plan to launch new product line...
Competitors are weak in this segment...
"""

index_document_with_permissions(
document_text=confidential_doc,
source="q3-strategy.pdf",
permissions=["role:cfo", "role:executive"], # Only CFOs and executives
vector_db=your_vector_db
)

# Index a public document
public_doc = """
Python 3.11 Release Notes: Major improvements to async/await...
"""

index_document_with_permissions(
document_text=public_doc,
source="python-3.11-notes.md",
permissions=["public"], # Everyone can access
vector_db=your_vector_db
)

Filtering Retrieval by User Role

At query time, filter results based on the logged-in user's permissions. Here is the retrieval function:

from openai import OpenAI

client = OpenAI()

class User:
"""Represents an authenticated user with roles/permissions."""
def __init__(self, user_id: str, roles: list[str]):
self.user_id = user_id
self.roles = roles # e.g., ["employee", "sales-team"]

def can_access(self, document_permissions: list[str]) -> bool:
"""Check if this user is allowed to access a document."""
# Simple check: if any user role matches any document permission
for role in self.roles:
if role in document_permissions or "public" in document_permissions:
return True
return False

def retrieve_with_access_control(
query: str,
user: User,
vector_db,
k: int = 10
) -> list[dict]:
"""Retrieve documents, filtering by user permissions."""

# Step 1: Embed query
query_embedding = client.embeddings.create(
input=query,
model="text-embedding-3-small"
).data[0].embedding

# Step 2: Query vector database (returns more results to account for filtering)
# Retrieve 2x expected results to compensate for permission filtering
raw_results = vector_db.query(
vector=query_embedding,
top_k=k * 2,
include_metadata=True
)

# Step 3: Filter by permissions
authorized_results = []
for result in raw_results["matches"]:
document_permissions = result["metadata"]["permissions"]

# Check if user is authorized
if user.can_access(document_permissions):
authorized_results.append(result)
else:
# Log unauthorized access attempt (for compliance)
log_unauthorized_access(
user_id=user.user_id,
chunk_id=result["id"],
permissions_required=document_permissions,
timestamp="2026-06-02T14:30:00Z"
)

# Return up to k authorized results
return authorized_results[:k]

def log_unauthorized_access(
user_id: str,
chunk_id: str,
permissions_required: list[str],
timestamp: str
) -> None:
"""Log access denial for compliance auditing."""
log_entry = {
"event": "unauthorized_access_attempt",
"user_id": user_id,
"chunk_id": chunk_id,
"permissions_required": permissions_required,
"timestamp": timestamp
}
# Write to secure audit log (database, S3, syslog)
print(f"[AUDIT] {log_entry}")

# Example usage
current_user = User(
user_id="alice_123",
roles=["employee", "sales-team"]
)

results = retrieve_with_access_control(
query="What is our Q3 revenue strategy?",
user=current_user,
vector_db=your_vector_db,
k=10
)

if not results:
print("No documents found, or all matching documents require higher permissions.")
else:
for result in results:
print(f"- {result['metadata']['source']}: {result['metadata']['text'][:100]}...")

Advanced Permission Models

Simple role matching scales only so far. For complex organizations, use more sophisticated models:

Attribute-Based Access Control (ABAC) Each document and user has attributes (department, clearance level, project), and permissions are computed dynamically:

def user_can_access(user: dict, document: dict) -> bool:
"""ABAC: check multiple attributes."""
# User in same department as document: allow
if user["department"] == document["department"]:
return True

# User clearance >= document sensitivity: allow
clearance_levels = {"public": 0, "internal": 1, "confidential": 2, "secret": 3}
if clearance_levels.get(user["clearance"], 0) >= clearance_levels.get(document["sensitivity"], 0):
return True

# User in document's explicit allowlist: allow
if user["user_id"] in document.get("allowed_users", []):
return True

return False

Time-Based Access Allow temporary access to sensitive documents that expires after a date:

def can_access_with_expiry(user: User, document: dict, current_time: str) -> bool:
"""Check if user has access, considering expiry dates."""
if "access_expiry" in document:
if current_time > document["access_expiry"]:
return False # Access has expired

return user.can_access(document["permissions"])

Auditing Access

Log every document access (success and failure) for compliance:

def retrieve_with_audit(
query: str,
user: User,
vector_db,
k: int = 10
) -> list[dict]:
"""Retrieve with complete audit logging."""

results = retrieve_with_access_control(query, user, vector_db, k)

# Audit log
audit_entry = {
"query": query,
"user_id": user.user_id,
"user_roles": user.roles,
"results_returned": len(results),
"chunks_accessed": [r["id"] for r in results],
"timestamp": "2026-06-02T14:30:00Z",
"approved": True
}

write_audit_log(audit_entry) # Immutable, tamper-evident log

return results

Key Takeaways

  • Tag every document and chunk with fine-grained permission labels at indexing time.
  • Filter retrieval results by user permissions before passing to the LLM; filtering is mandatory for security.
  • Use role-based or attribute-based access control depending on organizational complexity.
  • Log all access attempts (success and failure) for compliance and incident investigation.
  • Retrieve extra results (2x) during filtering to ensure users get K results even after permission filtering.

Frequently Asked Questions

What if a user should have temporary access to a sensitive document?

Use time-based access: add an access_expiry field to the document. At retrieval time, check if the current timestamp exceeds the expiry; if so, deny access. This avoids having to reindex documents.

Can I use LLM-based filtering instead of database-level filtering?

Not recommended for security. LLMs are generative; they might inadvertently leak information in reasoning chains or log. Always filter at the database layer before the LLM touches the data.

How do I handle documents with mixed permissions (some users allowed, others not)?

Use composite permissions: tag the document with an explicit allowlist: permissions=["user_id:alice", "user_id:bob", "role:admin"]. At retrieval time, check if the current user ID or any of their roles match.

Should I show users why a document was filtered?

For transparency, yes—but only in logs, not in the chat interface. If a user queries and no results are returned, saying "You don't have access to documents matching this query" is appropriate. Listing which specific documents they cannot see is a security risk (information leakage).

How do I rotate permissions if a user's role changes?

Permissions are metadata in the vector database. Update the metadata directly via your vector database's update API (no re-indexing needed). Log the change in your audit trail.

Further Reading