Skip to main content

RAGAS Framework: Automated RAG Evaluation

RAGAS (Retrieval-Augmented Generation Assessment) is an open-source framework that automates RAG evaluation without requiring gold-standard labels. It scores RAG outputs using four core metrics: Faithfulness (is the answer grounded?), Answer Relevance (does it address the query?), Context Precision (are retrieved passages relevant?), and Context Recall (do passages contain the answer?). RAGAS enabled me to evaluate RAG systems in production without expensive annotation campaigns.

The framework uses LLM-based evaluation, leveraging language models to score answers and passages semantically rather than relying on exact-match metrics. This makes it flexible and applicable to any domain. RAGAS has become the de facto standard in the RAG community for automated evaluation.

Understanding RAGAS Metrics

RAGAS defines four metrics that correspond to different stages of the RAG pipeline:

Faithfulness measures whether the generated answer is grounded in the retrieved context (0–1 scale). It decomposes the answer into atomic facts and checks each against the context using an LLM. High faithfulness indicates no hallucinations.

Answer Relevance measures whether the answer actually addresses the user's question (0–1 scale). It uses an LLM to assess semantic alignment between query and answer.

Context Precision measures whether retrieved passages are relevant to answering the query (0–1 scale). Unlike traditional precision, it does not require ground-truth labels—the LLM judges relevance directly.

Context Recall measures whether retrieved passages contain all information needed to answer the query (0–1 scale). It checks whether the context sufficiently covers the answer.

Installing and Using RAGAS

Install RAGAS from PyPI and set up your API credentials for the LLM backend (OpenAI, Anthropic, or local models).

# pip install ragas

from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevance,
context_precision,
context_recall
)
from datasets import Dataset
import os

# Set API key for LLM evaluation
os.environ["OPENAI_API_KEY"] = "your-key-here"

# Prepare evaluation dataset
# Format: List[Dict] with keys: "question", "answer", "contexts", "ground_truth"
eval_dataset = {
"question": [
"What is the capital of France?",
"How do you handle errors in Rust?"
],
"answer": [
"Paris is the capital of France.",
"Rust uses the Result type to handle errors."
],
"contexts": [
["France is a Western European country. Its capital is Paris."],
["The Result enum in Rust represents either Ok or Err variants."]
],
"ground_truth": [
"Paris",
"Result and Option types"
]
}

# Convert to HuggingFace Dataset
dataset = Dataset.from_dict(eval_dataset)

# Run RAGAS evaluation
result = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevance,
context_precision,
context_recall
]
)

print(result)
# Output: Aggregated scores across all metrics

Building a RAGAS Evaluation Pipeline

For production RAG systems, wrap RAGAS in a monitoring pipeline that runs evaluations on batches of queries and tracks metrics over time.

from typing import List, Dict
import json
from datetime import datetime

class RAGASEvaluationPipeline:
"""Production-ready RAGAS evaluation harness."""

def __init__(self, model_name: str = "gpt-4"):
self.model_name = model_name
self.evaluation_history = []

def evaluate_rag_output(self,
query: str,
answer: str,
retrieved_passages: List[str]) -> Dict:
"""
Evaluate a single RAG output using RAGAS metrics.

Args:
query: User query.
answer: Generated answer.
retrieved_passages: Retrieved context passages.

Returns:
Dict with RAGAS metrics (faithfulness, answer_relevance, etc).
"""

eval_dataset = Dataset.from_dict({
"question": [query],
"answer": [answer],
"contexts": [retrieved_passages],
"ground_truth": [answer] # Use answer as reference
})

# Evaluate with RAGAS
result = evaluate(
eval_dataset,
metrics=[
faithfulness,
answer_relevance,
context_precision,
context_recall
]
)

scores = {
"faithfulness": result["faithfulness"],
"answer_relevance": result["answer_relevance"],
"context_precision": result["context_precision"],
"context_recall": result["context_recall"]
}

return scores

def evaluate_batch(self, examples: List[Dict]) -> Dict:
"""
Evaluate a batch of RAG examples.

Args:
examples: List of dicts with "query", "answer", "contexts".

Returns:
Aggregated metrics and per-example scores.
"""

eval_dataset = Dataset.from_dict({
"question": [e["query"] for e in examples],
"answer": [e["answer"] for e in examples],
"contexts": [e["contexts"] for e in examples],
"ground_truth": [e.get("answer", "") for e in examples]
})

result = evaluate(
eval_dataset,
metrics=[
faithfulness,
answer_relevance,
context_precision,
context_recall
]
)

aggregate_scores = {
"faithfulness_mean": result["faithfulness"].mean(),
"answer_relevance_mean": result["answer_relevance"].mean(),
"context_precision_mean": result["context_precision"].mean(),
"context_recall_mean": result["context_recall"].mean(),
"timestamp": datetime.now().isoformat()
}

self.evaluation_history.append(aggregate_scores)

return aggregate_scores

def check_regressions(self, new_scores: Dict,
regression_threshold: float = 0.05) -> List[str]:
"""
Check if new scores represent a regression vs. historical baseline.

Args:
new_scores: Latest evaluation scores.
regression_threshold: Threshold for flagging regression (e.g., 0.05 = 5% drop).

Returns:
List of metric names that regressed.
"""

if not self.evaluation_history:
return [] # No baseline to compare against

baseline = self.evaluation_history[-2] # Previous scores
regressions = []

for metric in ["faithfulness_mean", "answer_relevance_mean",
"context_precision_mean", "context_recall_mean"]:
if metric in baseline and metric in new_scores:
drop = (baseline[metric] - new_scores[metric]) / baseline[metric]
if drop > regression_threshold:
regressions.append(f"{metric} dropped {drop:.1%}")

return regressions

# Example usage
pipeline = RAGASEvaluationPipeline()

examples = [
{
"query": "What is the capital of France?",
"answer": "Paris is the capital of France.",
"contexts": ["France is a Western European country. Its capital is Paris."]
},
{
"query": "How do you handle errors in Rust?",
"answer": "Rust uses the Result type for error handling.",
"contexts": ["The Result enum in Rust represents Ok or Err variants."]
}
]

scores = pipeline.evaluate_batch(examples)
print(json.dumps(scores, indent=2))

Interpreting RAGAS Scores

RAGAS scores range from 0.0 (worst) to 1.0 (best). Use the following thresholds for production decisions:

  • Faithfulness > 0.8: Good grounding, minimal hallucinations.
  • Answer Relevance > 0.7: Answer adequately addresses the query.
  • Context Precision > 0.6: Retrieved passages are mostly relevant.
  • Context Recall > 0.8: Retrieved context is comprehensive.

If any metric falls below threshold, investigate the corresponding pipeline stage. Low context precision suggests retrieval failures. Low faithfulness suggests generation is not respecting retrieved content. Low answer relevance suggests the generator is off-topic.

Comparison with Other RAG Evaluation Methods

RAGAS uses LLM-based scoring, which is flexible but can be slow (API latency) and expensive (API costs). Alternative approaches include exact-match metrics (fast, limited coverage) and custom evaluation harnesses (tailored but require development effort).

MethodSpeedAccuracyCost
RAGAS (LLM-based)MediumHighMedium
Token overlapFastLowLow
Semantic similarityFastMediumLow
Custom harnessVariableHighHigh (development)

Key Takeaways

  • RAGAS provides four metrics (faithfulness, answer relevance, context precision, context recall) covering the entire RAG pipeline.
  • RAGAS uses LLMs to evaluate, making it domain-agnostic and flexible.
  • Build a production evaluation pipeline that runs RAGAS on batches, tracks metrics over time, and alerts on regressions.
  • Use RAGAS metrics to pinpoint failures: low context precision = retrieval issue, low faithfulness = generation issue.
  • Combine RAGAS with golden datasets and human review for comprehensive evaluation.

Frequently Asked Questions

How much does RAGAS evaluation cost?

RAGAS makes one LLM API call per metric per example, so evaluating 100 examples across 4 metrics requires 400 API calls. Costs depend on your LLM provider. Using GPT-4 is expensive; using GPT-3.5 or Claude-Haiku is cheaper but less accurate.

Can I use RAGAS with local models?

Yes, RAGAS supports local models via LiteLLM or by implementing a custom evaluator interface. Local models are free but require hosting infrastructure and may be less accurate than API-based models.

What if my domain is not well-covered by the RAGAS LLM evaluator?

RAGAS is flexible. You can define custom metrics by implementing the Metric interface and plugging them into the evaluation pipeline. The default metrics work well for most domains but can be supplemented with domain-specific scorers.

Should I use RAGAS alone or combine it with other evaluation methods?

Combine RAGAS with golden datasets and spot-check human review. RAGAS is fast and reproducible but can miss subtle issues. Use it for continuous monitoring and regression testing; supplement with human evaluation for releases.

Further Reading