Inteligencia Artificial Metodologías

control de calidad determinismo evaluaciones ia llm pruebas testing

Testing with AI: the determinism problem

November 11, 2025 11 min read 118 reads

Table of contents

Key takeaways
Why classical tests break
The three-layer strategy
Layer 1: code surrounding the model
Layer 2: tolerant snapshots
Layer 3: automated-judge evaluation
Offline evaluations: what actually catches regressions
How to think the decision
Conclusion

Actualizado: 2026-05-03

Testing systems that contain language models breaks the fundamental axiom on which the entire automated testing discipline was built: given the same input, the system produces the same output. Generative models don’t guarantee that property even with zero temperature. After more than a year integrating language models in products with real users, this article collects a set of strategies that work and another that doesn’t.

Key takeaways

Traditional testing with exact assertions breaks with language models: the same call can return slightly different sentences without anything changing in your code.
The three-layer strategy works: deterministic code around the model (classic unit tests), tolerant snapshots by semantic similarity, and LLM-as-judge evaluations for critical cases.
The offline evaluation belt doesn’t measure absolute quality but drift: what detects problems is the score suddenly dropping, not a specific number.
Evaluation cases must be adversarial, not representative: a belt replicating average usage gives high stable scores that detect nothing.
No test belt replaces watching real users use the product.

Why classical tests break

Traditional testing rests on exact assertions. With a language model, the same call can return slightly different sentences. Writing an exact assertion against the output, the test fails the next day without anything having changed in your code.

The second wall is models evolving underneath you. The model you call today is not strictly the same one you’ll call three months from now, even if the name doesn’t change. A test belt working perfectly can start failing without you touching the code. In classical engineering that would be unacceptable; in AI systems it’s the normal state.

The third issue is external dependency. Each test run consumes tokens, costs money, has network latency. A 500-test belt can take 15 minutes and cost several euros per run, making it unfeasible on every continuous integration push. This forces segregating fast deterministic tests from slow stochastic tests.

The three-layer strategy

Layer 1: code surrounding the model

The first layer is the code surrounding the model, which must be fully deterministic and tested with classical assertions. All the logic of prompt composition, response parsing, error handling, format normalization, and deciding when to call the model belongs here. In the projects analyzed, this layer concentrates 70 to 80 percent of tests and runs in under ten seconds.

Layer 2: tolerant snapshots

The second layer is tolerant snapshot tests. The real model output is captured in a reference run, saved as a snapshot, and subsequent runs compare against it with a semantic similarity metric. If the current output resembles the snapshot at more than 90 percent by sentence embeddings, it passes.

Layer 3: automated-judge evaluation

The third layer is LLM-as-judge: a model distinct from the one producing the answer evaluates whether the answer meets specific criteria. The most powerful but also the most expensive and slowest layer, reserved for a curated set of critical cases.

Offline evaluations: what actually catches regressions

The offline evaluation belt is built from representative use cases, each with an input and a set of acceptance criteria, and it runs every time you change the prompt, model, or temperature. Results aggregate into a global score compared against the previous run.

The important point is the belt doesn’t measure absolute quality but drift. What detects problems is when it suddenly drops. Evaluation cases must be adversarial, not representative: a belt built with the cases where the product is most likely to fail actually warns when something changes.

How to think the decision

AI testing requires abandoning nostalgia for exact assertions and accepting a probabilistic model of quality. The practical consequence: teams integrating AI seriously must invest proportionally more in evaluation and observability than in classical coverage. Not less discipline; different discipline.

Conclusion

No test belt replaces watching real users use the product. Belts catch known regressions; users find the unknown ones. Detection of issues in production, with conversation logs and satisfaction surveys, must be a central piece of the strategy.

Was this useful?

[Total: 0 · Average: 0]

Post Views: 118

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Testing with AI: the determinism problem

Key takeaways

Why classical tests break

The three-layer strategy

Layer 1: code surrounding the model

Layer 2: tolerant snapshots

Layer 3: automated-judge evaluation

Offline evaluations: what actually catches regressions

How to think the decision

Conclusion

Related posts

“EU AI Act 2026: a technical checklist for Spanish CTOs”

Agent observability with OpenTelemetry GenAI semconv in 2026

How to install and tune oMLX on M5 Max 128 GB

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026