Metodologías

blameless cultura incidentes post mortem runbook sre

Blameless Post-Mortems: How to Actually Improve

February 8, 2024 11 min read 176 reads

Table of contents

Key Takeaways
Why “Officially Blameless” Post-Mortems Fail
The Three Non-Negotiable Elements
Timeline: Details Matter
The “5 Whys” Problem
Blameless Interview Techniques
Action Items That Get Done
The Quarterly Meta-Post-Mortem
Small Incidents Too
Culture: The Unseen Factor
Conclusion

Actualizado: 2026-05-03

“Doing blameless post-mortems” has been the SRE mantra since Google popularised it. Everyone nods. Few organisations do it well. Between concept and practice there is a distance most teams fail to close — and the result is empty ritual: filled-out templates nobody reads, action items nobody executes, and the same incidents repeating.

This article is about how to run post-mortems that actually produce learning and change — with concrete techniques, not good intentions.

Key Takeaways

Performative blameless — saying “no blame” while making someone’s involvement clear — destroys the mechanism.
The three non-negotiable elements are: factual timeline, honest contributor analysis, and action items with owner and deadline.
The “5 whys” assume linear causality; real incidents are multi-causal.
Error budget policy and post-mortems are complementary tools: both require that decisions change based on data.
The quarterly meta-post-mortem is what converts individual learning into systemic resilience.

Why “Officially Blameless” Post-Mortems Fail

Failure patterns are recognisable:

Disguised blame. The post-mortem says “no blame” but the narrative makes clear who was at fault. The implicated person knows.
Sanitised official narrative. What really happened is softened not to offend stakeholders. Real learning stays in private conversations.
Theatrical action items. “Add more monitoring” / “Improve documentation”. Vague, no owner, no deadline. Never done.
Not reading old post-mortems. Each incident seems new because nobody checks if it happened before.
Only the big ones go to post-mortem. You lose the learning from near-misses, which are more valuable because they are more frequent.

Recognising these patterns is step one.

The Three Non-Negotiable Elements

A functional post-mortem has:

A factual timeline of what happened, when, who saw it first, what was done.
An honest analysis of contributors — not just “what failed” but “what made it easy or possible to fail”.
Specific action items with owner and deadline, tracked to completion in a centralised system.

Without all three, it is wet paper.

Timeline: Details Matter

The timeline must answer six questions:

T-0: what was happening before the incident. Often reveals a forgotten trigger (deploy, cron, config change).
T + n: moment of initial failure. Who saw it first? How? (alert, customer, luck).
Escalation: how it reached the right person. If it took too long, that’s process, not person.
Mitigation: what worked, what didn’t, what was tried first.
Recovery: when service came back.
Follow-up: when officially closed.

Times in UTC or declared timezone. Better too many timestamps than too few.

The “5 Whys” Problem

5-whys is traditional technique: why did X fail? Because A. Why A? Because B. And so on to “root cause”. The problem is the assumption of linear, single causality. Real incidents are multi-causal: three services misalign at once, an alert existed but pager was misconfigured, a runbook existed but wasn’t found.

The better alternative is to think in contributors, not root cause. A list of factors that individually wouldn’t have caused the incident, but together did. Each deserves its own action item.

Blameless Interview Techniques

In the post-mortem meeting, the facilitator makes the difference. Five techniques that work:

Ask “what information did you have”, not “why did you make that decision”. The decision is explained by available information, not the reverse.
Chronology before interpretation. First agree what happened at each moment; then discuss why.
Refer to person by role, not name, in the document. “The on-call” instead of “John”. Avoids focusing on who when reading later.
Normalise human errors. “Anyone in that position with that information would have done the same” — if true, say it explicitly.
Separate observations from judgements. “The alert took 7 minutes to fire” (observation) vs “the alert took too long” (judgement).

Action Items That Get Done

Badly defined action items are the post-mortem graveyard. Ones that get done have five characteristics:

Specific owner. A person, not a team. If a team, nobody does it.
Bounded deadline. “Q1” is too vague. “By 28 February” lands.
Clear completion criteria. Not “improve monitoring” — “add alert X with threshold Y, reviewed by Z”.
Centralised tracking. A system (Jira, Linear, GitHub Issues) where all action items live, with monthly review.
Proportionality. Not 20 action items per incident. Prioritise 3-5 that actually move the needle.

The Quarterly Meta-Post-Mortem

Quarterly, looking at accumulated post-mortems is what separates learning organisations from those that repeat cycles. Key questions:

Which action items were open and overdue?
Which patterns repeat across incidents?
Are there structural investments that would have prevented several incidents?
Are SLOs and error budgets informing those investment priorities?

Without meta-analysis the cycle is infinite. With it, focus shifts from firefighting to building resilience.

Small Incidents Too

Most organisations only post-mortem SEV-1 incidents. But the cheapest learnings come from SEV-3 and near-misses — events where something serious almost happened but was caught in time.

A light model for small incidents: five-line timeline, three contributors, one or two specific action items, no formal meeting. The volume of small learning, aggregated, often exceeds that of a few large incidents.

Culture: The Unseen Factor

Techniques help but culture decides. Healthy culture signals:

A junior engineer can say “I broke production” without fear.
Leaders openly discuss their own mistakes.
Lessons learned are celebrated, not hidden.
Resources for action items are priority, not afterthought.

Changing culture takes years. Starting with techniques is the way — over time culture adapts to well-executed rituals.

Conclusion

Blameless post-mortems are a powerful tool when done well. The difference between theatre and real learning is in the details: factual timeline, honest contributor analysis, action items with owner and deadline, continuous tracking, and quarterly pattern review. The bigger cost is in rigour, not technique.

Was this useful?

[Total: 10 · Average: 4.5]

Post Views: 176

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.