AI incident postmortems: what they have taught us

Portátil abierto sobre escritorio con luz tenue y terminal visible, escena representativa de los equipos que han escrito postmortems públicos durante 2025 y 2026 tras incidentes en sistemas con inteligencia artificial en producción; el conjunto de casos compartidos ha permitido identificar patrones comunes como fallos silenciosos de guardrails, deriva sin alarma de modelos en producción, dependencia oculta de proveedores externos y errores clásicos de operación agravados por la novedad de la capa de IA

Over the last year, more and more teams running AI in production have started publishing detailed incident postmortems. The practice, inherited from classic SRE culture, is consolidating in the new territory of LLMs and agent systems, and the 2025 harvest plus the first months of 2026 now allows for an ordered reading of the patterns that repeat. It’s worth distilling them because many teams are about to make the same mistakes others have documented in detail.

Pattern one: silently failing guardrails

The most repeated pattern in recent postmortems is silent guardrail failure. Teams that built systems with input validation, output filtering, prompt-injection detection, and tool-call containment discovered, sometimes months later, that one of those mechanisms had stopped working without generating an alert. The typical pattern is a base-model update by the provider slightly changing behavior, breaking the guardrail’s heuristic, and nobody finds out because the observable metric doesn’t visibly change.

A documented case involved a customer-support system where the output PII filter depended on regex detection assuming certain response formats. When the provider updated to a version slightly reformatting some outputs, the regex began letting sensitive information through for weeks. Final detection came via external audit, not the monitoring system, illustrating a deeper problem: the team had never validated the guardrail still worked after model changes.

The lesson drawn clearly is that guardrails need their own periodic synthetic tests verifying end-to-end function, not just component activation. A guardrail not receiving traffic it should reject isn’t proving it works; only that nobody’s testing it. Mature teams have introduced guardrail tests injecting known adversarial inputs at regular intervals and verifying the filter keeps blocking them.

Pattern two: silent model drift

Another recurring pattern is what several postmortems call silent drift. The base model, run by an external provider, changes behavior subtly without the team detecting until a sharp user reports it. Changes can be style, response length, tolerance of certain input types, or real capability on complex tasks. Rarely catastrophic, but they degrade system quality during the time they pass unnoticed.

A postmortem published by a medical-assistant company describes exactly this phenomenon. For roughly six weeks, response accuracy on a subset of clinical questions degraded gradually after a minor model update. The system kept working, users hadn’t changed use patterns, and basic availability metrics showed nothing. Only by introducing an automated evaluation question bank with expected answers did the team detect the regression and have evidence to demand greater transparency from the provider.

The lesson is that any production system with an external model needs its own evaluation bank run regularly, with expected answers validated by expert humans, and alerts when match rate falls below a threshold. Without this mechanism, the team depends on provider goodwill for notification of relevant changes, and experience shows notification is late and incomplete.

Pattern three: hidden vendor dependency

Several postmortems have highlighted how teams who believed they had manageable model-provider dependency discovered, during an incident, the dependency was much deeper than assumed. A particularly instructive case happened when a provider had a prolonged outage and a team that had designed failover to an alternative provider discovered their prompt system was so tuned to the specific behavior of the down model that the alternative model produced significantly worse results.

The dependency wasn’t only availability; it was exact behavior. Prompts, interaction patterns, format expectations, and evaluation criteria had evolved over months to fit a specific model’s peculiarities, and switching providers required non-trivial redesign of much of the system. Failover technically existed but was nearly useless.

The lesson here is twofold. First, regularly test failover with real traffic, not just verify that pipes are connected. Second, design the system to work reasonably well with at least two different models from the start, imposing less-specific-peculiarity prompt discipline. It’s extra work during development but avoids the trap of discovering hidden dependency at the worst possible moment.

Pattern four: classic operation worsened by novelty

A considerable share of recent postmortems aren’t actually AI incidents, but classic operation incidents manifesting in novel or delayed ways because the AI layer masked signals. Memory leaks in workers processing large inputs, database connection problems, expired certificates, poorly coordinated deployments, rotated secrets not updated. These problems have been known for decades, but the team didn’t anticipate them well in the context of AI-component systems because standard monitoring didn’t cover the specific patterns of these systems.

A concrete case involved a very short timeout configured on the external-model client. During normal conditions it worked, but during provider high-load moments, timeouts triggered retries saturating internal resources and generating an error cascade presenting to users as general system slowness. Nobody had reviewed AI-specific timeouts during usual capacity-planning exercises, because they didn’t fit the traditional-system mindset.

The lesson is that systems with AI components aren’t a separate category from reliability engineering; they’re systems requiring application of known practices to new components. Circuit breakers, exponential-backoff retries, specific external-API call monitoring, dashboards combining AI metrics with traditional system metrics. Nothing of this is conceptually new, but many teams are relearning these lessons in the AI context, and learning is expensive when done in production.

Pattern five: tool use with unexpected effects

Agent systems with tool use have produced their own particularly interesting postmortem category. The typical pattern is an agent that, under normal conditions, invokes external tools reasonably, but under certain adversarial or unexpected inputs enters loops, invokes tools with harmful parameters, or combines several tools in sequences with unforeseen side effects.

A case documented in a postmortem involved an agent with access to an email-sending API that, after a specific input, invoked the tool repeatedly sending hundreds of emails to users before the external rate limit broke the chain. The immediate lesson was that agents need their own rate limits per tool, not just at the global system, because external limits protect the provider but not necessarily the system’s users.

Another more general lesson is that every tool accessible to the agent needs its own explicit threat model. It’s not enough to think of the system as a whole; enumerate what each tool can do, its side effects, reversibility, and applicable controls. This exercise, done before production, prevents many surprises postmortems document over and over.

Practices mature teams are adopting

From the accumulation of postmortems, several concrete practices are consolidating among experienced operational teams. Continuous synthetic evaluations against reference banks, for both verifying base-model behavior and testing guardrails and tools. Clear separation between infrastructure, model, and product metrics, with dashboards correlating incidents across the three layers.

AI-specific incident-response procedures, with runbooks covering scenarios like evaluation-detected model drift, external-provider saturation, guardrail failure, anomalous agent behavior. These procedures are no longer improvised on the spot; mature companies have written them and drill them with periodic game days.

Provider contracts including clauses on communication of relevant changes, SLAs differentiated by use criticality, and access to model-behavior metrics. 2023 generic contracts didn’t have these terms; 2026 ones increasingly include them because teams have learned what to ask after others’ postmortems.

When it pays to publish postmortems

Not all companies can or want to publish postmortems, but those that do gain technical reputation and receive valuable feedback from other teams. It pays off when the incident has generalizable lessons helping the ecosystem, when the company can describe the problem without compromising security or confidential information, and when there’s internal will to learn in public, which not all cultures tolerate.

For those who don’t publish, at least writing rigorous internal postmortems is a discipline separating teams that learn from those that repeat. The classic incident-impact-root-cause-timeline-lessons-actions format remains useful in AI context, with added specific sections on model version, affected prompts, involved tools, and relevant evaluation data.

My reading

Postmortem culture in AI systems has matured noticeably between 2025 and 2026. It’s still uneven by team, and there are still companies publishing vague descriptions avoiding useful technical detail, but the overall level has risen. The engineering community now has a documented-case corpus sufficient to learn without having to make each mistake for the first time, and teams systematically reading these postmortems are clearly better prepared than those only stumbling on their own incidents.

The most important transversal lesson is that AI in production is reliability engineering applied to new components, not a completely different discipline. Principles of observability, fault containment, defense in depth, and systematic learning remain valid, and teams applying them rigorously have fewer incidents and better postmortems than those still treating AI as special territory where old rules don’t apply. No shortcuts: what worked for decades for critical systems keeps working, only now there are more components requiring specific attention.

Entradas relacionadas