Blameless Post-Mortems: How to Actually Improve

Tablero con notas adhesivas y diagrama de flujo representando análisis post-incidente

“Doing blameless post-mortems” has been the SRE mantra since Google popularised it. Everyone nods. Few organisations do it well. Between concept and practice there’s a distance most teams fail to close — and the result is empty ritual: templates filled out that nobody reads, action items nobody executes, and the same incidents repeating.

This article is about how to do post-mortems that actually produce learning and change — with concrete techniques, not good intentions.

Why “Officially Blameless” Post-Mortems Fail

Failure patterns are recognisable:

  • Disguised blame. The post-mortem says “no blame” but the narrative makes clear who failed. The implicated person knows. Blameless is performative.
  • Sanitised official narrative. What really happened is softened not to offend stakeholders. Real learning stays in private conversations.
  • Theatrical action items. “Add more monitoring” / “Improve documentation”. Vague, no owner, no deadline. Never done.
  • Not reading old post-mortems. Each incident seems new because nobody checks if it happened before.
  • Only big ones go to post-mortem. You lose learning from near-misses, which are more valuable.

Recognising these patterns is step one to doing it well.

The Three Non-Negotiable Elements

A decent post-mortem has:

  1. A factual timeline of what happened, when, who saw it first, what was done.
  2. An honest analysis of contributors — not just “what failed” but “what made it easy or possible to fail”.
  3. Specific action items with owner and deadline, tracked to completion.

Without all three, it’s wet paper.

Timeline: Details Matter

The timeline must answer:

  • T-0: what was happening before the incident. Often reveals a forgotten trigger (deploy, cron, config change).
  • T + n: moment of initial failure. Who saw it first? How? (alert, customer, luck).
  • Escalation: how it reached the right person. If it took too long, that’s process, not person.
  • Mitigation: what worked, what didn’t, what was tried first.
  • Recovery: when service came back.
  • Follow-up: when officially closed.

Times in UTC or clear zone. Better too many timestamps than too few.

The “5 Whys” Problem

5-whys is traditional: why did X fail? Because A. Why A? Because B. And so on to “root cause”.

Problem: assumes linear and single causality. Real incidents are multi-causal. Three services align badly at once. An alert existed but the pager was misconfigured. A runbook existed but wasn’t found.

Better alternative: think contributors, not root cause. List of factors that individually wouldn’t have caused the incident, but together did. Each deserves action.

Blameless Interview Techniques

In the post-mortem meeting, the facilitator makes the difference. Techniques that work:

  • Ask “what information did you have”, not “why did you make that decision”. The decision is explained by available information, not the other way around.
  • Chronology before interpretation. First agree what happened at each moment; then discuss why.
  • Refer to person by role, not name, in the document. “The on-call” instead of “John”. Avoids focus on who when reading later.
  • Normalise human errors. “Anyone in that position with that information would have done the same” — if true, say it.
  • Separate observations from judgements. “The alert took 7 minutes to fire” (observation) vs “the alert took too long” (judgement).

A trained facilitator changes the room’s tone.

Action Items That Get Done

Badly defined action items are the post-mortem graveyard. Ones that get done have:

  • Specific owner. A person, not a team. If team, nobody does it.
  • Bounded deadline. “Q1” is too vague. “By 28 February” lands.
  • Clear completion criteria. Not “improve monitoring” — “add alert X with threshold Y, reviewed by Z”.
  • Centralised tracking. A system (Jira, Linear, GitHub Issues) where all action items live. Monthly review.
  • Proportionality. Not 20 action items per incident. Prioritise 3-5 that actually move the needle.

An action item without deadline won’t get done. An action item with deadline but no follow-up, neither.

The Archive: Search Before Entering

Before a new post-mortem, search old similar ones. Common patterns:

  • Same cause, same mitigation, same theoretical lesson. Action items weren’t done.
  • Different cause but similar category. There’s a systemic issue.
  • Incident predicted in an old post-mortem but not prioritised.

A searchable wiki or folder with all post-mortems is gold. Greppable > pretty-formatted.

The Meta-Post-Mortem

Quarterly, look at accumulated post-mortems:

  • Which action items were open and overdue?
  • Which patterns repeat?
  • What are we not learning?
  • Are there structural investments that would have prevented several incidents?

This meta-analysis is what converts organisations from “firefighting” to “building resilience”. Without it, the cycle is infinite.

Small Incidents Too

Most organisations only do post-mortems for SEV-1. But the cheapest learnings come from SEV-3 and near-misses — events where something serious almost happened but was caught in time.

Light model for small incidents:

  • 5-line timeline.
  • 3 contributors.
  • 1-2 specific action items.

No formal meeting. In Slack or issue. The small-learning volume, aggregated, exceeds a few big ones.

Practical Template

Minimal viable template:

## Summary
[2 paragraphs: what happened, customer impact, duration]

## Timeline
- 14:02 UTC - Deploy service X version 1.4.2
- 14:15 UTC - First alert: latency p99 >2s
- 14:17 UTC - On-call receives page
- 14:20 UTC - Rollback initiated
- 14:35 UTC - Service restored

## Impact
- Customers affected: ~15% of traffic
- Impact duration: 20 min
- Revenue impact: $X (if applicable)

## Contributors
1. Change included N+1 query undetected in staging (low load).
2. Latency alert configured with 5-min delay.
3. Rollback runbook outdated.

## What Worked Well
- On-call reacted fast.
- Rollback procedure worked when executed.

## Action Items
- [ ] @alice: add slow-query detection to CI, deadline 22/02.
- [ ] @bob: reduce p99 alert delay from 5min to 2min, deadline 15/02.
- [ ] @carol: update rollback runbook with current steps, deadline 20/02.

Links to code, dashboards, tickets.

Culture: The Unseen Factor

Techniques help but culture decides. Healthy culture signals:

  • A junior engineer can say “I broke production” without fear.
  • Leaders openly discuss their own mistakes.
  • Lessons learned are celebrated, not hidden.
  • Resources for action items are priority, not afterthought.
  • Process health metrics (action items completed, time to publish post-mortem).

Changing culture takes years. Starting with techniques is the way — over time, culture adapts.

Conclusion

Blameless post-mortems are a powerful tool when done well. The difference between theatre and real learning is in details: factual timeline, honest contributors, action items with owner and deadline, continuous tracking, quarterly pattern review. Instating the practice takes time, but return is real — organisations that do it well have fewer repeated incidents and teams that don’t burn out under unresolved problems. The bigger cost is in rigour, not technique.

Follow us on jacar.es for more on SRE, operations culture, and incident management.

Entradas relacionadas