Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial Metodologías

Retrieval Evaluation Frameworks: Ragas and Similar

Retrieval Evaluation Frameworks: Ragas and Similar

Actualizado: 2026-05-03

Building a RAG system is relatively easy: embeddings + vector DB + LLM. Measuring whether it works well is the real challenge. Are answers faithful to the retrieved context? Is the context relevant? Does the answer address the question? Without metrics, evaluation is intuition. Ragas[1] and similar frameworks turn those questions into numbers comparable across versions.

Key takeaways

  • Ragas defines four core metrics: faithfulness, answer_relevancy, context_precision, and context_recall.
  • Faithfulness detects hallucinations: what fraction of answer claims are supported by the retrieved context.
  • Integrating evaluation in CI from day one detects regressions before they reach production.
  • Evaluating with GPT-4 has significant cost: subset mode and cheaper evaluators reduce spending.
  • Metrics are a proxy: periodic human review remains the ground truth.

The four metrics that matter

Faithfulness: fraction of answer claims derivable from context. Low faithfulness = hallucinations.

Answer Relevancy: does the answer address the original question? Evaluated by generating hypothetical questions from the answer and comparing with the original.

Context Precision: what fraction of retrieved context is relevant? Penalises noisy retrieval. Useful for tuning chunk size and top-k.

Context Recall: does retrieved context contain all info needed to answer? Requires ground truth. Detects when retrieval misses important information.

Basic Ragas usage

python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

data = {
    "question": ["What is RAG?"],
    "answer": ["RAG combines retrieval with generation."],
    "contexts": [["RAG (Retrieval Augmented Generation) combines..."]],
    "ground_truth": ["RAG is a technique..."]
}
result = evaluate(Dataset.from_dict(data),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

CI integration

python
results = evaluate(dataset, metrics=[faithfulness])
if results["faithfulness"] < 0.8:
    sys.exit(1)  # Fails CI, blocks merge

Thresholds from historical baseline. A 10% drop in faithfulness is a red flag.

Ragas alternatives

TruLens: similar metrics + built-in web dashboard, strong LangChain integration. DeepEval: pytest-like framework, easy custom metrics, CI/CD ready. Giskard: RAG eval + security + bias, commercial with free tier. Arize Phoenix: LLM-app observability including eval, open source + SaaS.

Cost management

500 questions × 4 metrics × 2-3 LLM calls ≈ 6000 calls ($60-300 per full eval). Strategies: subset mode in CI (50 questions per PR), cheaper evaluator (GPT-4o mini) for most, GPT-4 for critical, cache results.

The evaluation dataset

Tools are cheap. A representative, curated, maintained dataset is the difference between metrics that detect real problems and metrics that give false confidence. Build it from real user logs, annotated by domain experts, 50-500 examples to start, including edge cases.

Conclusion

Evaluating RAG rigorously is the difference between a system that “works in demo” and one that “works in production”. Ragas offers standard metrics with accessible implementation. Building your own representative dataset is the most valuable asset. Integrating evaluation in CI from day 1 is the investment with the best return in any serious RAG project.

Was this useful?
[Total: 11 · Average: 4.6]
  1. Ragas

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.