AI agent incidents: recovery runbooks that work
Actualizado: 2026-05-03
Eighteen months into running agents in production, what distinguishes serious operation from amateur is not model sophistication, it’s the ability to contain incidents in minutes instead of hours. Agents break in unexpected ways, and response matured more slowly than the technology.
This runbook captures the pattern that works: the sequence of actions experienced teams execute when the alarm fires, in the order that most reduces damage.
Key takeaways
- Severity classification in the first two minutes determines the response: SEV-1 and SEV-2 trigger immediate paging; SEV-3 and SEV-4 don’t.
- The natural impulse is to check logs first. The correct pattern is isolate first, understand second.
- If the agent has persistent memory or a vector store, code rollback isn’t enough: purge the suspicious window.
- An honest communiqué admitting “there was unexpected behaviour, we’re investigating” ages better than a narrative later dismantled.
- An incident that doesn’t enter the evaluation battery as a regression test hasn’t been closed; only masked.
Severity classification in the first two minutes
A clear taxonomy avoids overreaction and underestimation:
- SEV-1: exposure of personal data or unauthorised actions with external consequences (payments, sent emails, altered records). Immediate paging, 24×7 response.
- SEV-2: quality degradation affecting all users without data exposure. Immediate paging, 24×7 response.
- SEV-3: regressions scoped to one input category. Handled during business hours.
- SEV-4: slow drift detected by metrics without user-visible impact. Enters backlog with weekly SLA.
This classification isn’t bureaucracy; it’s the difference between waking the team at 3 AM for something that can wait until Monday and not waking anyone when something critical is happening.
Isolate before investigating
The natural impulse is to read logs. The correct pattern is isolate first, understand second.
For SEV-1 and SEV-2, the first action is pulling the current version and rolling back to the last known stable. Rollback doesn’t close the incident but freezes it: the system stops producing damage while the team analyses.
If the agent has persistent memory or a vector store, code rollback isn’t enough. Purge memory for the suspicious window. Tools like LangSmith[1] and Braintrust[2] allow marking traces as toxic. Without that, the nuclear option is wiping memory and accepting the agent “forgets” the last hours.
User communication without invention
Layered messaging that works:
- In-product banner: indicates degraded service with estimated ETA.
- Specific notification: directly affected users receive individual communication.
- Internal channel: stakeholders (legal, support, sales) see the situation in the incident channel in real time.
What you don’t do:
- Promise technical details before having them.
- Blame the model provider without evidence.
- Minimise scope.
An honest communiqué admitting “there was unexpected behaviour, we’re investigating” always ages better than a narrative later dismantled.
Cold analysis: the trace is everything
With the agent isolated, analysis begins. The irreplaceable tool is the full trace:
- User input.
- Retrieved content (RAG, tools).
- Tool calls with their results.
- Agent internal state.
- Final response.
Teams without this level spend hours speculating; teams with it reproduce the failure in minutes.
The practice: tag each trace with a stable hash and archive with 30-day minimum retention. When anomalies are reported, similar cases can be searched without reinventing the work.
Permanent mitigation: the case enters as regression
Once the failure mode is identified, the temptation is “fix and move on”. The maturity step is writing the case as a regression test before shipping the fix.
If the incident doesn’t enter the evaluation battery, it hasn’t been closed — only masked until next time.
The fix is validated against the test. Passes, promotes. Fails, iterates. This loop forces actually understanding the failure.
Actionable post-mortem without scapegoats
Within 48 hours of close, write the post-mortem. Format:
- Minute-by-minute timeline.
- Failure mode in one sentence.
- Root cause (or “pending” explicitly, if still unclear).
- Corrective actions with owner and date.
- Preventive actions to avoid recurrence.
What you don’t do: blame individuals, omit uncomfortable parts, close without follow-up date.
A well-done post-mortem at six months is an asset; a poorly done one is a document nobody revisits.
Conclusion
Operating agents in production is operations, in the classic SRE sense: metrics, runbooks, post-mortems, continuous improvement. The difference from traditional services is broader and more unpredictable failure surface, which makes operational discipline more important, not less. Teams internalising this sleep well.