How to Write a Blameless Postmortem That People Actually

I’ve written and reviewed a few hundred postmortems over 25 years of on-call. Most of the bad ones share a tell: they read like a legal deposition. Lots of “the engineer failed to,” lots of passive voice hiding who did what, and a list of action items nobody ever closes. They get filed and forgotten.

A blameless postmortem is supposed to do one thing: make the next incident less likely or less painful. If yours isn’t doing that, the format is wrong, not your team.

What “blameless” actually means

Blameless does not mean “we don’t mention what people did.” That’s the misunderstanding that produces useless, vague documents. Blameless means you assume every person acted reasonably given the information they had at the time, and you investigate why the system let a reasonable action cause harm.

The deploy that took down checkout wasn’t caused by the engineer who clicked the button. It was caused by a pipeline that let an untested migration reach production, a rollback that took eleven minutes, and an alert that fired four minutes too late. Name the human action neutrally, then go after the system.

When you get this right, people volunteer information instead of hiding it. That’s the entire payoff: psychological safety produces better data, and better data produces better fixes.

The template I use

Keep it boring and consistent. Engineers should know exactly where to look for each piece.

1. Summary

Three sentences, written for someone who wasn’t there. What broke, who was affected, how long, and how it ended. No jargon.

2. Impact

Quantify it. “Checkout error rate peaked at 38% for 22 minutes; an estimated 4,100 customers saw a failed payment.” Vague impact (“some users were affected”) makes prioritization impossible later.

3. Timeline

The spine of the document. Timestamp every meaningful event: first symptom, alert firing, human detection, key diagnostic findings, the fix, and resolution. Mark the gap between when it broke and when you knew it broke — that gap is usually your biggest detection finding.

4. Root cause(s)

Plural on purpose. Real incidents have a chain. Use the “five whys” but stop chasing once you reach something you can actually change.

5. What went well

Don’t skip this. If the rollback worked, say so — you want to protect what’s working when you start changing things.

6. What went wrong / where we got lucky

“Where we got lucky” is the most valuable section and the one people forget. Near-misses are free lessons.

7. Action items

The whole point. Covered below.

Action items that actually get done

Most postmortem action items die because they’re vague, unowned, and unscheduled. Fix all three:

Specific: Not “improve monitoring.” Instead: “Add an alert on checkout p99 latency > 800ms for 2 min.”
Owned: A single named human, not a team.
Tracked: A real ticket with a due date, linked from the postmortem.
Classified: Tag each as prevent (stops recurrence), detect (finds it faster), or mitigate (reduces blast radius). A postmortem with only “prevent” items and no “detect” items means you’ll be just as blind next time.

I keep a running rule: no postmortem is “done” until every action item is in the tracker. Writing them in a doc that nobody triages is theater.

Run the review as a conversation, not a verdict

Schedule the review within a few days while memory is fresh. Open by restating the blameless contract out loud — it sounds corny, but it resets the room. Walk the timeline together; that’s where you’ll discover the three things nobody mentioned in the heat of the incident.

Watch your language as facilitator. “Why did you restart the pod?” sounds like an accusation. “What did the dashboard show that made restarting look right?” gets you the actual reasoning. Same question, completely different data.

Where AI helps — and where it shouldn’t

The blank page is the enemy of the postmortem getting written at all. This is exactly where AI earns its place. Right after resolution, paste the incident-channel scrollback and your command history and ask for a first-draft timeline and summary. You’ll get a structured draft in seconds that you then correct and enrich.

A prompt I reach for:

“Here is the incident channel transcript and command history. Draft a blameless postmortem: summary, impact, chronological timeline with timestamps, candidate root causes, what went well, and proposed action items tagged prevent/detect/mitigate. Use neutral language and do not assign blame to individuals.”

Two cautions. First, AI flattens nuance — it will turn “we suspected the cache but weren’t sure” into a confident statement. You have to put the uncertainty back in, because the uncertainty is often the lesson. Second, scrub secrets and customer data before pasting anything.

We keep a set of incident-response prompts tuned for this, and the Incident Response tool will turn a resolved-incident timeline into a structured postmortem draft you can edit.

The test of a good postmortem

Six months later, can a new hire read it and understand what happened, why, and what you changed? If yes, it worked. If it reads like someone defending themselves, it didn’t — and your next incident will teach you the same lesson at full price.

AI-generated postmortem drafts are a starting point. Always verify the timeline and root cause against your own records before publishing.

How to Write a Blameless Postmortem That People Actually Read