Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial Metodologías

How to Evaluate a RAG System Without Fooling Yourself

How to Evaluate a RAG System Without Fooling Yourself

Actualizado: 2026-05-03

Building a RAG system is relatively straightforward. Deciding whether it works well is much harder. Many teams deploy systems that look impressive in internal demos and disappoint in production because they never established a rigorous evaluation process. This article covers how to measure RAG quality honestly, what metrics to use, how to build evaluation sets, and the most common mistakes.

Key takeaways

  • Intuition deceives: a handful of engineer-built questions is not rigorous evaluation.
  • The four key dimensions are faithfulness, answer relevance, context precision, and context coverage.
  • The golden set is the most valuable asset: 100-500 curated questions representing real, difficult, and out-of-scope cases.
  • LLM as judge is scalable but has biases — periodically validating with humans is mandatory.
  • Integrating evaluation into the CI pipeline is the difference between improving and believing you’re improving.

Why Intuition Deceives

The natural temptation when evaluating a RAG system is to try a handful of questions, see seemingly reasonable answers, and declare success. This approach fails for several reasons:

  • The questions an engineer constructs to test are different from those real users will ask.
  • Easy cases are over-represented.
  • Confirmation bias leads to interpreting ambiguous answers as correct.
  • Model variability is high and a small sample doesn’t capture it.

Without structured measurement, the team develops improvements without knowing whether they improve anything. Worse, they may worsen without noticing until users complain.

The Four Key Dimensions

RAG quality isn’t a single metric but a combination. Four dimensions capturing most problems:

Faithfulness: measures whether the response is supported by the retrieved context. A response can be plausible but contain information the context didn’t support — hallucinations.

Answer relevance: measures whether the answer effectively addresses the formulated question. A system can return correct but tangential answers to user intent.

Context precision: measures what fraction of retrieved context is actually relevant to the question. Retrieving a hundred documents of which only two are useful penalises the generator.

Context coverage: measures whether retrieved context contains all information needed to answer completely. Requires a reference answer to compare.

Frameworks like Ragas automate calculation of these metrics using an LLM as judge. For starting teams, Ragas is a pragmatic entry point.

Build a Golden Set

The most valuable asset in RAG evaluation isn’t the framework but the golden set — a carefully curated set of question-answer pairs representing real cases.

Starting point is real user logs if you have production. For new systems, interviews with target users and domain experts generate the most realistic questions. A typical golden set has between 100 and 500 questions, with deliberate coverage of easy, medium, hard, ambiguous, and out-of-scope cases.

For each question, the golden set should include the correct answer (human-curated) and, optionally, the knowledge-base documents that should be retrieved.

Golden set maintenance is continuous activity. As the product evolves and new cases emerge, questions are added.

LLM as Judge: Uses and Limits

Many modern evaluations use an LLM as judge to score answers. It’s scalable and relatively cheap. But it has limitations:

  • The LLM judge tends to prefer longer answers, more formal, with more sophisticated structure.
  • Can be too lenient on certain subtle errors humans would catch.
  • If the same model is used for both generating and judging, bias is even higher.

Habitual mitigation involves using a different model than the generator for judging, regularly validating with human sampling, and using carefully designed evaluation prompts.

Never use only LLM as judge without periodic human validation. Monthly sampling of 50 human-reviewed answers, compared with automatic verdict, shows whether the judge is calibrated with reality.

Continuous Evaluation in the Pipeline

Once golden set is built and metrics defined, evaluation should integrate into development cycle. Every prompt change, every model change, every chunking change should pass through automated evaluation before merge.

The pattern that works in production includes thresholds on key metrics:

  • If faithfulness drops below 0.8, the build fails.
  • If answer relevance drops more than 5% against baseline, alert.

Additionally, real traffic samples in production should pass through continuous evaluation — not just the golden set.

Common Mistakes

Patterns that repeat across projects:

  • Optimising metrics that don’t correlate with user satisfaction: a system can have high faithfulness but answers users perceive as useless.
  • Too-easy golden set: if the system scores 95% on all metrics but users complain, the set doesn’t capture real difficulty.
  • Not versioning the golden set with code: if the set changes without traceability, results can’t be compared across system versions.

Conclusion

Rigorously evaluating a RAG is the difference between a system that works well and one that seems to work well. Cost of implementing continuous evaluation is moderate; cost of not doing so is systems disappointing in production without the team knowing why. Tools like Ragas democratised infrastructure; the golden set and discipline of measuring consistently are now the hard part. For starting teams, a minimum set of 50 well-selected questions and the four basic metrics is already enormous leap over evaluating without structure.

Was this useful?
[Total: 11 · Average: 4.3]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.