Continuous evaluation of RAG: dashboards that actually matter
Actualizado: 2026-05-03
RAG systems have gone from interesting prototypes to production components at many companies over the last year. With that transition comes the problem any production system has: it degrades if nobody watches it. The difference is that a RAG degrades more silently than a traditional service. It doesn’t go down, it doesn’t return 500s, it doesn’t trip the latency alert. It simply starts answering worse, and users notice before the teams do.
Key takeaways
- Initial evaluation has a short shelf life: the index changes, the model changes, user questions evolve.
- The metrics that hold up in production are: retrieval precision, effective recall, answer faithfulness to context, answer relevance, and p99 latency.
- LLM-as-judge has documented length and position biases; mitigation requires a judge from a different model family and periodic human-evaluation calibration.
- A two-layer architecture — continuous shadow evaluation plus weekly regression evaluation — catches 80 % of degradation scenarios.
- Four well-chosen metrics with clear alerts are worth more than fifteen metrics nobody reviews.
Why the initial evaluation is not enough
Almost every team deploying a RAG runs an evaluation set upfront. That initial evaluation is necessary but has a short shelf life.
The index changes. New documents get reindexed, others are retired, chunks split differently after a reingestion. The model changes too: OpenAI ships new versions, Anthropic refreshes Claude, teams swap the generator for cost or latency reasons. What worked yesterday with a specific model over a specific index stops working when either piece moves.
The third factor is question drift. Real users don’t ask the same things on the month of launch as six months later. Emerging use cases aren’t covered by the initial evaluation set, and the RAG may be failing in the long tail without anyone noticing. The cost of this silent degradation connects directly with FinOps for AI: spend rises while quality falls.
Metrics that hold up in production
The RAG evaluation literature proposes many metrics. In production, a small subset earns its keep:
- Retrieval precision. Of retrieved documents, what fraction is relevant. Precision below 60 % is almost always bad news.
- Effective recall. For verifiable questions, whether the fragment that should answer them is among those retrieved. Requires ground truth but catches the retriever missing key information.
- Answer faithfulness to context. Whether the model’s answer relies on retrieved chunks or fabricates content. Key for detecting hallucinations and the most prone to degrade when the generator changes.
- Answer relevance. Whether the answer actually addresses the question or drifts toward what’s in the chunks.
- End-to-end latency and distribution. Not just the mean but the p95 and p99 tail. RAGs have a skewed latency distribution because reranking can pull the tail without affecting the median.
The LLM-as-judge problem
Many of the above metrics get implemented using an LLM as evaluator. It works better than having no evaluation, but has documented biases:
- Length sensitivity: prefers longer responses even when not better.
- Positional bias: favours the first option in comparative evaluations.
- Excessive leniency when evaluating answers from the same model or family.
Mitigations are known: use a judge from a different model family than the generator, randomise answer order in comparative evaluations, calibrate with a periodic subset of human evaluations, and track relative trends rather than trusting absolute scores. Continuous evaluation is useful primarily for detecting changes, not establishing absolute quality truths.
Dashboard architecture
The pattern that has worked best separates evaluation into two layers.
First layer — shadow evaluation, running continuously over a sample of real queries. A small traffic percentage passes through automated LLM-as-judge evaluations and results are aggregated. It gives near-real-time signal, useful for catching sharp degradations after a deployment.
Second layer — regression evaluation, run on a fixed cadence over a curated question set. It happens after every significant system change or at least weekly. The curated set covers critical use cases with verified answers, enabling ground-truth metrics. It gives a more rigorous signal, useful for auditing longer-term trends.
The dashboard combines both: near-real-time metrics go to Grafana with band-configured alerts; regression metrics feed weekly reports with version comparisons. If the first layer detects a drop, the second layer confirms before alerting the team.
Useful alert examples
- Retrieval precision drop below 60 % in a one-hour window → retriever problem, almost always an index change or embedding model update.
- Context faithfulness drop of 5 % or more on a deployment → the new generator is hallucinating more. Grounds for automatic rollback if deployment permits.
- Divergence between automated metrics and periodic human evaluation → the judge has broken. More common than it looks after judge model updates.
- Question distribution shift detected by a simple classifier → users are asking different things, forcing a review of whether the system covers the new cases.
What is not worth wiring up
Some metrics from papers yield little production return:
- BLEU and ROUGE look at lexical overlap and are practically noise for free-form generative answers.
- Lexical diversity metrics on output don’t correlate with perceived quality.
- Measuring generator perplexity over its own answers is a closed loop.
It is also worth avoiding dashboard bloat. A panel with fifteen metrics no one reviews doesn’t help. Better to have four or five well-chosen ones with clear alerts.
My read
Setting up continuous evaluation on a RAG demands investment teams often postpone because the system appears to work. That postponement carries a hidden cost: quality incidents in RAG are slow, cumulative and hard to attribute to a specific change when they finally surface.
The reasonable investment is modest. With a framework like Ragas[1] or DeepEval[2] and a couple of engineering weeks, you set up a usable pipeline. The expensive part is not technical but organisational: convincing the team that maintaining a curated evaluation set is ongoing work, not a one-off effort. Without that discipline, the set expires and the dashboard loses value within months.
My recommendation to anyone operating a RAG without continuous evaluation is to start from the simplest part: a traffic sample, three well-chosen metrics, a Grafana panel with threshold alerts. That alone catches 80 % of degradations. Finer metrics and regression evaluation are added later, once the investment is clearly paying off. The same observability discipline applies to the continuous eBPF profiling pattern described elsewhere on this site.