Retrieval Augmented Generation (RAG) Systems
TL;DR
- RAG connects an LLM to fresh evidence retrieved from documents or databases—useful when training cutoff dates or privacy rules forbid stuffing everything into weights.
- Reliability depends less on “better prompting” alone and more on retrieval quality, chunk boundaries, metadata filtering, and answer protocols (“cite or abstain”).
- Treat retrieval hits as untrusted strings: sanitize, dedupe, attribute sources, and measure hallucinations against golden tests.
Prerequisites
Before diving deep here, skim:
- Your First LLM-Powered Application — interfaces for prompts and evaluations.
- Building Robust Q&A Systems — abstention patterns and refusal UX.
Optional sibling preview:
- Building LangChain-Powered Applications — orchestration tooling once pipelines mature.
Core explanation
What RAG changes about prompting
Without retrieval, models interpolate from memorized statistics inside weights—fine for brainstorming, risky for regulated facts or volatile domains.
With retrieval, you assemble context packs: snippets plus citations plus timestamps plus confidence cues. The model becomes more like an interpreter sitting atop evidence—but does not magically verify truth. Garbage retrieval yields articulate nonsense unless governance catches it.
Anatomy of a baseline pipeline
- Ingest documents → normalize encoding, strip boilerplate, preserve headings where helpful.
- Chunk into retrieval units balancing recall vs specificity—tiny chunks lose surrounding meaning; giant chunks bury needles.
- Embed / index chunks into a searchable store (vector + lexical hybrids often outperform pure cosine similarity alone).
- Retrieve top‑k candidates using hybrid scoring + metadata filters (tenant ID, locale, doc version).
- Rerank (optional but frequently ROI-positive) using cross-encoder models or lightweight heuristics.
- Compose prompt that forbids extrapolation beyond cited passages unless flagged explicitly as speculation.
- Evaluate offline + online: citation precision, grounded answer accuracy, latency SLOs.
Grounding prompts that survive upgrades
Avoid vague encouragements (“be truthful”). Encode mechanical behavior:
You answer ONLY using QUOTES labeled [n].
Hard rules:
- If evidence does not resolve the question, respond with INSUFFICIENT_EVIDENCE plus what is missing.
- Each factual sentence ends with bracket citations like [2].
- Never infer pricing, regulatory obligations, or medical diagnoses beyond quoted text.
Evidence:
[1] Title: ...
Text: ...
Question:
...
Adapt brackets/citations to your UI renderer—consistency beats flair.
Global notes
- Legal/compliance framing stays generic here: jurisdictions differ—always involve qualified reviewers before deploying advice-heavy assistants (finance, healthcare, immigration, hiring).
- Locale: retrieval corpora often mix languages—detect language early or expose explicit locale switches—avoid silently assuming USD dates (
MM/DD) everywhere.
Worked example (synthetic HR handbook snippet)
Evidence excerpt:
[1] Remote employees must submit expense receipts within 30 calendar days.
[2] Meal reimbursement caps apply per region table §4.3 (updated quarterly).
Question: “Can I submit February receipts in April?”
Grounded answer sketch:
Based on [1], receipts must be submitted within 30 calendar days—February receipts submitted in April likely violate policy unless another clause grants exceptions (not present here).
Regarding caps: [2] references regional limits but exact amounts are not quoted—answer requires consulting §4.3 table.
Notice abstention where numeric caps aren’t quoted—this prevents plausible-but-wrong allowances.
Common mistakes
- Semantic collisions: embeddings confuse similarly worded clauses across jurisdictions—filter metadata aggressively.
- Stale citations: retrieving evergreen docs without version stamps trains users to trust outdated guidance—surface doc freshness explicitly.
- Over stuffing context: dumping entire PDFs wastes tokens and invites contradictory snippets—prefer hierarchical summaries + selective drill-down.
- No adversarial retrieval QA: attackers poison uploads—sandbox ingestion content and scan for injection attempts aimed at downstream prompts.
Checklist before promoting RAG to production traffic
- Offline golden set covering easy/medium/adversarial retrieval misses
- Canary dashboards for citation precision + abstention rate drift
- Tenant isolation validated at storage + prompt composition boundary
- Incident playbook when embeddings/index drift after re-ingestion
Hybrid lexical + dense retrieval
Pure embedding search excels at semantic similarity but struggles when users paste identifiers exactly—purchase order numbers, legal citations, hexadecimal faults. Combine signals:
- Run BM25 / SQLite FTS / OpenSearch style lexical retrieval for literal overlaps.
- Run dense retrieval for paraphrases (“invoice delays” vs “payment backlog”).
- Merge with reciprocal rank fusion or learned rerankers—simple max-score blends mis-weight noisy channels.
Operational guidance:
- Keep lexical indexes updated alongside embeddings—missing deletes poison answers silently (“ghost citations”).
- Tag chunks with origin (
pdf_page,wiki_rev,ticket_export) so prompts can tune skepticism—wikis drift faster than signed PDF policy bundles.
Latency budgeting matters globally: reranking adds milliseconds acceptable internally but painful on spotty mobile networks—tier retrieval depth per device class when UX demands it.
Evaluation loops beyond static datasets
Golden prompts decay once products evolve—invest in continuous evaluation scaffolding:
Offline replay harness
Capture anonymized retrieval payloads alongside model configurations—replay nightly against staging indexes measuring:
- Citation recall: percentage of facts anchored to retrieved passages vs plausible hallucinations flagged by graders.
- Answer completeness: rubric scores independent from lexical overlap—semantic similarity metrics complement human judging but rarely substitute entirely.
Store embeddings versions tied to replay manifests—otherwise you cannot bisect regressions spanning both model and retrieval updates.
Online guardrails with human spot checks
Automate canary prompts hitting production routes—alert when distributions shift:
- Sudden spikes in abstentions might indicate ingestion outages—not harmless conservatism.
- Drops in abstentions paired with rising user thumbs-down signal dangerous overconfidence—possibly poisoned corpora promoting misleading unanimity.
Red teaming ingestion pathways
Assume hostile uploads—even internal wikis include compromised contractor laptops occasionally:
- Sandbox parsers extracting PDF text—reject macros outright.
- Normalize unicode homoglyphs attackers embed to collide filenames—cheap preprocessing prevents perplexing retrieval collisions downstream.
These defensive investments rarely appear on roadmap decks yet disproportionately reduce existential QA headlines later.
FAQ
Do we always need vectors?
No—pure lexical search remains competitive for SKU/part-number-heavy corpora; hybrids reduce surprises.
Should answers quote verbatim or paraphrase?
Higher-risk domains benefit from quoting constraints + shorter paraphrases separately labeled.
How often should corpora refresh?
Align with document SLA owners—automatic staleness badges beat silent drift—international subsidiaries publish amendments asynchronously—surface revision timestamps prominently.
Does multilingual retrieval require separate indexes?
Often yes—or tokenization-aware hybrid pipelines struggle crossing scripts—evaluate cross-lingual embedding models deliberately rather than assuming English anchors suffice globally.
Chunk migrations without silent regressions
Re-embedding looks mechanical until subtle regressions erase recall overnight. Treat migrations like database schema changes:
Dual-write rehearsal
Keep prior indexes readable while staging replacements. Shadow-query both stacks and compare top-k overlap—not chasing perfection, but catching catastrophic divergence early. Replay grounded-answer harnesses nightly so citation regressions surface before executives demo dashboards live.
Gradual traffic shifting
Flip cohorts gradually—finance desks deserve slower migrations than internal labs. Combine quantitative deltas with qualitative notes (“sales wiki synonyms diverged”) so writers adjust ingestion normalization proactively instead of blaming models reflexively.
Timestamps and locales
Store canonical UTC timestamps alongside localized display titles—subsidiaries publish asynchronously worldwide; ambiguous “published yesterday” narratives confuse auditors. Separate multilingual variants rather than collapsing titles prematurely—collision ghosts are miserable to debug after launch.
Capacity and rollback
Schedule bursts outside regional peaks—not headquarters midnight assumptions—and define automated rollback thresholds when latency or abstention spikes breach guardrails.
What's next?
Continue orchestration patterns with Building LangChain-Powered Applications—then revisit fundamentals:
Exercise
Pick ten internal FAQ pairs—simulate three deliberate retrieval misses (synonyms, duplicates, stale docs)—measure whether your prompt abstains vs hallucinates; iterate grounding instructions accordingly.