Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial Metodologías

faithfulness llm metrics rag evaluation ragas relevance trulens

Retrieval Evaluation Frameworks: Ragas and Similar

June 16, 2024 8 min read 110 reads

Table of contents

Key takeaways
The four metrics that matter
Basic Ragas usage
CI integration
Ragas alternatives
Cost management
The evaluation dataset
Conclusion

Actualizado: 2026-05-03

Building a RAG system is relatively easy: embeddings + vector DB + LLM. Measuring whether it works well is the real challenge. Are answers faithful to the retrieved context? Is the context relevant? Does the answer address the question? Without metrics, evaluation is intuition. Ragas^[1] and similar frameworks turn those questions into numbers comparable across versions.

Key takeaways

Ragas defines four core metrics: faithfulness, answer_relevancy, context_precision, and context_recall.
Faithfulness detects hallucinations: what fraction of answer claims are supported by the retrieved context.
Integrating evaluation in CI from day one detects regressions before they reach production.
Evaluating with GPT-4 has significant cost: subset mode and cheaper evaluators reduce spending.
Metrics are a proxy: periodic human review remains the ground truth.

The four metrics that matter

Faithfulness: fraction of answer claims derivable from context. Low faithfulness = hallucinations.

Answer Relevancy: does the answer address the original question? Evaluated by generating hypothetical questions from the answer and comparing with the original.

Context Precision: what fraction of retrieved context is relevant? Penalises noisy retrieval. Useful for tuning chunk size and top-k.

Context Recall: does retrieved context contain all info needed to answer? Requires ground truth. Detects when retrieval misses important information.

Basic Ragas usage

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

data = {
    "question": ["What is RAG?"],
    "answer": ["RAG combines retrieval with generation."],
    "contexts": [["RAG (Retrieval Augmented Generation) combines..."]],
    "ground_truth": ["RAG is a technique..."]
}
result = evaluate(Dataset.from_dict(data),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

CI integration

python

results = evaluate(dataset, metrics=[faithfulness])
if results["faithfulness"] < 0.8:
    sys.exit(1)  # Fails CI, blocks merge

Thresholds from historical baseline. A 10% drop in faithfulness is a red flag.

Ragas alternatives

TruLens: similar metrics + built-in web dashboard, strong LangChain integration. DeepEval: pytest-like framework, easy custom metrics, CI/CD ready. Giskard: RAG eval + security + bias, commercial with free tier. Arize Phoenix: LLM-app observability including eval, open source + SaaS.

Cost management

500 questions × 4 metrics × 2-3 LLM calls ≈ 6000 calls ($60-300 per full eval). Strategies: subset mode in CI (50 questions per PR), cheaper evaluator (GPT-4o mini) for most, GPT-4 for critical, cache results.

The evaluation dataset

Tools are cheap. A representative, curated, maintained dataset is the difference between metrics that detect real problems and metrics that give false confidence. Build it from real user logs, annotated by domain experts, 50-500 examples to start, including edge cases.

Conclusion

Evaluating RAG rigorously is the difference between a system that “works in demo” and one that “works in production”. Ragas offers standard metrics with accessible implementation. Building your own representative dataset is the most valuable asset. Integrating evaluation in CI from day 1 is the investment with the best return in any serious RAG project.

Was this useful?

[Total: 11 · Average: 4.6]

Post Views: 110

Ragas

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Retrieval Evaluation Frameworks: Ragas and Similar

Key takeaways

The four metrics that matter

Basic Ragas usage

CI integration

Ragas alternatives

Cost management

The evaluation dataset

Conclusion

Related posts

“EU AI Act 2026: a technical checklist for Spanish CTOs”

Agent observability with OpenTelemetry GenAI semconv in 2026

How to install and tune oMLX on M5 Max 128 GB

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026