Testing with AI: the determinism problem
Actualizado: 2026-05-03
Testing systems that contain language models breaks the fundamental axiom on which the entire automated testing discipline was built: given the same input, the system produces the same output. Generative models don’t guarantee that property even with zero temperature. After more than a year integrating language models in products with real users, this article collects a set of strategies that work and another that doesn’t.
Key takeaways
- Traditional testing with exact assertions breaks with language models: the same call can return slightly different sentences without anything changing in your code.
- The three-layer strategy works: deterministic code around the model (classic unit tests), tolerant snapshots by semantic similarity, and LLM-as-judge evaluations for critical cases.
- The offline evaluation belt doesn’t measure absolute quality but drift: what detects problems is the score suddenly dropping, not a specific number.
- Evaluation cases must be adversarial, not representative: a belt replicating average usage gives high stable scores that detect nothing.
- No test belt replaces watching real users use the product.
Why classical tests break
Traditional testing rests on exact assertions. With a language model, the same call can return slightly different sentences. Writing an exact assertion against the output, the test fails the next day without anything having changed in your code.
The second wall is models evolving underneath you. The model you call today is not strictly the same one you’ll call three months from now, even if the name doesn’t change. A test belt working perfectly can start failing without you touching the code. In classical engineering that would be unacceptable; in AI systems it’s the normal state.
The third issue is external dependency. Each test run consumes tokens, costs money, has network latency. A 500-test belt can take 15 minutes and cost several euros per run, making it unfeasible on every continuous integration push. This forces segregating fast deterministic tests from slow stochastic tests.
The three-layer strategy
Layer 1: code surrounding the model
The first layer is the code surrounding the model, which must be fully deterministic and tested with classical assertions. All the logic of prompt composition, response parsing, error handling, format normalization, and deciding when to call the model belongs here. In the projects analyzed, this layer concentrates 70 to 80 percent of tests and runs in under ten seconds.
Layer 2: tolerant snapshots
The second layer is tolerant snapshot tests. The real model output is captured in a reference run, saved as a snapshot, and subsequent runs compare against it with a semantic similarity metric. If the current output resembles the snapshot at more than 90 percent by sentence embeddings, it passes.
Layer 3: automated-judge evaluation
The third layer is LLM-as-judge: a model distinct from the one producing the answer evaluates whether the answer meets specific criteria. The most powerful but also the most expensive and slowest layer, reserved for a curated set of critical cases.
Offline evaluations: what actually catches regressions
The offline evaluation belt is built from representative use cases, each with an input and a set of acceptance criteria, and it runs every time you change the prompt, model, or temperature. Results aggregate into a global score compared against the previous run.
The important point is the belt doesn’t measure absolute quality but drift. What detects problems is when it suddenly drops. Evaluation cases must be adversarial, not representative: a belt built with the cases where the product is most likely to fail actually warns when something changes.
How to think the decision
AI testing requires abandoning nostalgia for exact assertions and accepting a probabilistic model of quality. The practical consequence: teams integrating AI seriously must invest proportionally more in evaluation and observability than in classical coverage. Not less discipline; different discipline.
Conclusion
No test belt replaces watching real users use the product. Belts catch known regressions; users find the unknown ones. Detection of issues in production, with conversation logs and satisfaction surveys, must be a central piece of the strategy.