Retrieval Evaluation Frameworks: Ragas and Similar
Actualizado: 2026-05-03
Building a RAG system is relatively easy: embeddings + vector DB + LLM. Measuring whether it works well is the real challenge. Are answers faithful to the retrieved context? Is the context relevant? Does the answer address the question? Without metrics, evaluation is intuition. Ragas[1] and similar frameworks turn those questions into numbers comparable across versions.
Key takeaways
- Ragas defines four core metrics: faithfulness, answer_relevancy, context_precision, and context_recall.
- Faithfulness detects hallucinations: what fraction of answer claims are supported by the retrieved context.
- Integrating evaluation in CI from day one detects regressions before they reach production.
- Evaluating with GPT-4 has significant cost: subset mode and cheaper evaluators reduce spending.
- Metrics are a proxy: periodic human review remains the ground truth.
The four metrics that matter
Faithfulness: fraction of answer claims derivable from context. Low faithfulness = hallucinations.
Answer Relevancy: does the answer address the original question? Evaluated by generating hypothetical questions from the answer and comparing with the original.
Context Precision: what fraction of retrieved context is relevant? Penalises noisy retrieval. Useful for tuning chunk size and top-k.
Context Recall: does retrieved context contain all info needed to answer? Requires ground truth. Detects when retrieval misses important information.
Basic Ragas usage
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
data = {
"question": ["What is RAG?"],
"answer": ["RAG combines retrieval with generation."],
"contexts": [["RAG (Retrieval Augmented Generation) combines..."]],
"ground_truth": ["RAG is a technique..."]
}
result = evaluate(Dataset.from_dict(data),
metrics=[faithfulness, answer_relevancy, context_precision, context_recall])CI integration
results = evaluate(dataset, metrics=[faithfulness])
if results["faithfulness"] < 0.8:
sys.exit(1) # Fails CI, blocks mergeThresholds from historical baseline. A 10% drop in faithfulness is a red flag.
Ragas alternatives
TruLens: similar metrics + built-in web dashboard, strong LangChain integration. DeepEval: pytest-like framework, easy custom metrics, CI/CD ready. Giskard: RAG eval + security + bias, commercial with free tier. Arize Phoenix: LLM-app observability including eval, open source + SaaS.
Cost management
500 questions × 4 metrics × 2-3 LLM calls ≈ 6000 calls ($60-300 per full eval). Strategies: subset mode in CI (50 questions per PR), cheaper evaluator (GPT-4o mini) for most, GPT-4 for critical, cache results.
The evaluation dataset
Tools are cheap. A representative, curated, maintained dataset is the difference between metrics that detect real problems and metrics that give false confidence. Build it from real user logs, annotated by domain experts, 50-500 examples to start, including edge cases.
Conclusion
Evaluating RAG rigorously is the difference between a system that “works in demo” and one that “works in production”. Ragas offers standard metrics with accessible implementation. Building your own representative dataset is the most valuable asset. Integrating evaluation in CI from day 1 is the investment with the best return in any serious RAG project.