Artificial Intelligence Methodologies

#agentes ia #calidad #evaluacion #llm as judge #mlops #observabilidad #produccion

Production-grade agent evaluations: the framework that works

April 22, 2026 7 min 273 4.3

ROC curve as a representation of the model-evaluation and performance-comparison work that underlies production agent evaluation systems

Table of contents

Key takeaways
Why "works on my machine" no longer cuts it
The golden dataset as foundation
LLM-as-judge for qualitative metrics
Deterministic metrics for verifiable properties
CI regression
Production observability
What to avoid
A practical starting point
Conclusion

Updated: 2026-07-12

The 2024 narrative was optimistic: a capable model plus a thoughtful prompt was all you needed to build a useful agent. The 2026 reality, after hundreds of teams have shipped agents to real users, is that what separates an agent that works from one drifting is almost always the least glamorous piece of the stack: evaluations. Not models, not prompts, not tools, measurement. See also: AI agent incidents: recovery runbooks that work.

This piece captures the pattern that has converged across teams keeping agents in production with reasonable reliability. It’s not academic theory; it’s the set of decisions you see repeated when you compare notes with engineers who have been minding the same system for eighteen months.

Key takeaways

The golden dataset is the foundation: 30 to 200 cases, with 60% normal usage, 30% edge cases, and 10% adversarial.
The LLM judge must not be the same model being evaluated; the judge’s prompt must ask for rubric evaluation, not a global score.
Deterministic metrics (JSON validation, URL verification, hard-failure counts) must be filtered before invoking the judge.
The regression threshold that blocks merge: 2% is noise, 5% is signal, 10% is an incident.
The minimum viable setup can be built in a week and starts paying back from the first incident avoided.

Why "works on my machine" no longer cuts it

During the first adoption wave, many teams leaned on manual testing: the product read answers, visually approved that they "sounded right", and pushed to production. That works while the state space is small. It stops working as soon as three things happen at once:

The model updates.
Query distribution shifts.
The team grows beyond the point where anyone holds the full context in their head.

The symptoms are always the same: silent regressions after a model change, prompts that looked equivalent degrade quality in specific cases, a new type of input triggers failures nobody anticipated. This is the point where lacking evaluations stops being manageable debt and becomes the primary bottleneck. For broader context on failure modes, see lessons from agents in production.

The golden dataset as foundation

The indispensable component is a representative, small, carefully curated test set. The initial temptation is to capture thousands of real inputs and fire them at the model, but teams that succeed converge on deliberately compact datasets:

Between 30 and 200 examples, each with its expected answer and the rationale for being there.
Typical split:
60% happy-path cases representing normal usage.
30% edge cases where the agent tends to get confused.
10% adversarial cases designed to break it on purpose.

Curation matters because the dataset must cover the failure modes your system has historically exhibited. When someone reports a bug, the correct response is to fix it and also add the case to the golden dataset with a clear annotation: "this case failed in version X because the model ignored constraint Y". Six months in, the dataset is a living record of every mistake the team has learned to avoid.

A frequent error is including only examples where the correct answer is unique and clear. Agents rarely operate in that regime. Most real tasks admit several valid answers, and the dataset must reflect that. Overly deterministic datasets overfit to one style and miss subtle quality degradations.

LLM-as-judge for qualitative metrics

For the metrics you can’t check programmatically, coherence, helpfulness, style adherence, absence of fabricated information, the mature pattern is to use a model as judge. Two non-negotiable rules:

The judge should NOT be the same model being evaluated.
The judge’s prompt must be extremely specific about criteria.

The format that generates fewest false positives is asking the judge not for a global score but for a rubric evaluation: "rate these six dimensions, each on a 1–5 scale, and justify each with one sentence". Asking the judge for a holistic score produces inconsistent results across runs; asking it to rate independent dimensions produces results that correlate reasonably with human judgment.

A worthwhile refinement: calibrate the judge against a subset of cases rated by humans. If the judge gives 4.5 where humans give 3, the judge is inflating and the prompt needs recalibrating. This check doesn’t need to happen often, quarterly, or when the judge model changes, but it’s the only mechanism that guarantees the number you read reflects something real.

Deterministic metrics for verifiable properties

Anything you can measure without a model, measure without a model:

If the agent produces JSON → validate the schema and count parse failures as hard errors.
If it calls tools → check that calls execute without exceptions.
If outputs contain URLs → verify they resolve with HTTP 200.
If there are length constraints → measure them.

These metrics are cheap, fast, deterministic, and catch regressions an LLM-as-judge would miss because it would focus on narrative.

The working combination is a pyramid:

Filter first with deterministic metrics (any case failing here is a hard failure that blocks integration).
Run the judge over cases that passed the hard filters.
Only then aggregate to a quality score per rubric.

CI regression

Evaluations without automated regression are nostalgic exercises. The consolidated pattern is running the suite on every change that touches the prompt, the model, the tools, or the agent’s chain, and blocking merge if the aggregate metric drops beyond a threshold:

2% drop → typical noise between runs.
5% drop → regression signal worth investigating.
10% drop → incident.

For high-change-volume teams, the suite splits into three tiers:

Smoke test (~10 critical cases): runs on every push, feedback in under a minute.
Nightly full suite: takes 10 to 20 minutes over the full golden dataset.
Cross-model comparison: triggered when considering a model switch, comparing the new candidate against current on every dimension.

Production observability

The golden dataset covers cases you know about. Real users produce cases you don’t. The final piece is capturing production traces and promoting them to the dataset when you detect unexpected behaviour.

Tools like LangSmith, Braintrust, HoneyHive, or Promptfoo offer specific flows for this loop. The delicate part is what to capture. Ideally:

The user input.
The agent response.
Every tool call with its result.
The agent’s final state.

Less than that and you can’t reproduce the failure. With that trace level, any incident can be replayed offline, debugged calmly, and added to the golden dataset so it never happens again. This level of traceability is the same that enterprise agent governance requires from the audit plane.

What to avoid

Some antipatterns are already folklore among experienced teams:

Evaluating only with the model-as-judge without any deterministic anchor, because it amplifies judge errors without detection.
Using a single aggregate metric as sole indicator, because it hides where the problem is.
Running evals only on the golden dataset without production observability, because real failures happen outside the dataset.
Accepting judge scores without calibrating against humans, because they may measure something other than what you think.
Blocking merge with rigid thresholds that are actually statistical noise, because you end up ignoring them and lose the benefit.

A practical starting point

If you have nothing, the minimum path is:

30 curated cases with expected answers.
A judge prompt with a five-dimension rubric.
Three or four domain-specific deterministic checks.
A script that runs everything in ninety seconds and outputs a number and a breakdown.
Wired into CI to run on every PR that touches the agent.

That baseline can be built in a week and starts paying back from the first incident avoided.

From there, expansion is opportunistic: every new bug adds a case, every relevant dimension adds a metric, every model change triggers a full comparison. The system grows with the product.

Conclusion

Evaluations are the quiet engine of agents that work in production. They aren’t glamorous, don’t demo well, don’t go viral. They are the infrastructure that lets everything else, models, prompts, tools, architectures, be compared, improved, and defended.

Teams that invest in evaluations six months before needing them are grateful they did; teams that build them after an incident learn the expensive way. The question is when, not if.

Production-grade agent evaluations: the framework that works

Key takeaways

Why "works on my machine" no longer cuts it

The golden dataset as foundation

LLM-as-judge for qualitative metrics

Deterministic metrics for verifiable properties

CI regression

Production observability

What to avoid

A practical starting point

Conclusion

AI explained without the hype

Share this article

Was this article helpful?

Related posts

OpenRouter: A Gateway for AI Models

browser-use: agents that browse the web

Firecrawl: Web Data for Agents

Composio: Tools and Integrations for Agents