Finding Systemic Themes Across Postmortems With AI

Our postmortem archive had two years of incidents in it, written by dozens of engineers, and I’d bet money that buried in there were five patterns explaining most of our pain. The trouble is that nobody had read all of it. You’d have to sit down with sixty documents, each written in a slightly different style, and hold all of them in your head at once to see the threads connecting them. No human does that on a Tuesday. So the patterns stayed invisible, and we kept fixing the same class of problem one incident at a time forever.

This is the central frustration of incident response done well: you can write excellent postmortems and still never improve, because each one optimizes locally. A postmortem makes you fix the specific bug that bit you. It doesn’t make you see that the same root weakness has bitten you eight different ways. Seeing that requires reading across the whole corpus, and reading across a corpus of unstructured documents is exactly the task that modern AI is unreasonably good at.

The local-fix trap

A single postmortem is a microscope. It zooms in on one incident, finds its contributing factors, and produces action items to prevent that exact failure. This is necessary and good. It’s also structurally incapable of catching systemic problems, because a systemic problem shows up as a minor footnote in many incidents rather than the headline of any one.

Consider a flaky deployment process. In one postmortem it’s “deploy partially failed, rolled back.” In another it’s “config didn’t propagate to all nodes.” In a third it’s “a manual deploy step got skipped.” Three different headlines, three different local fixes, one underlying truth: your deployment process is unreliable. Read individually, you fix three symptoms. Read together, you see the disease. But “read together” is the hard part, and it’s the part that gets skipped.

Pointing AI at the corpus

Hand the model the corpus and ask it to do the wide read no human will. The task is clustering and pattern extraction across documents: which incidents share contributing factors, which services recur in timelines, which action items keep getting written against the same gap. The AI Incident Response Assistant is built for exactly this cross-document synthesis — turning a folder of inconsistent writeups into a ranked list of candidate themes a human can evaluate.

The framing matters enormously. Ask for patterns and evidence, not conclusions and fixes. My prompt: “Across these postmortems, identify recurring contributing factors. For each pattern, list the specific incidents that exhibit it and quote the relevant text. Rank by frequency. Do not propose solutions.” You want the model to be a research assistant that brings you grouped evidence, not an oracle that hands you answers you’ll be tempted to act on without thinking.

Pro Tip: Require a citation for every claimed pattern — the specific incident and the exact quote. This does two jobs: it lets you instantly verify a pattern is real rather than a hallucinated theme, and it gives you the evidence you’ll need to convince leadership that the pattern is worth funding a fix for. An uncited pattern is worthless; a cited one is a business case.

A theme worth a quarter of engineering time

Here’s what this looked like in practice. We ran our two-year archive through this analysis, and the top pattern by frequency wasn’t anything we’d have guessed: a third of our SEV2-and-above incidents had a timeline entry where someone was delayed because they couldn’t find or trust the relevant runbook. It was never the root cause — the root causes were varied and technical. But “responder lost time on bad documentation” was a contributing factor in a third of our serious incidents, and no single postmortem had ever flagged it as worth fixing because in each one it was just a footnote.

That pattern, with its citations, made the case for a real investment in runbook quality and discoverability — a project that touched dozens of future incidents instead of one. The AI found the thread by reading what no human had read in full. The humans made every decision that mattered: confirming the pattern was real by spot-checking the cited incidents, deciding it was worth a quarter of someone’s time, and designing what the actual fix looked like. AI for the read; humans for the judgment.

Verifying before you act

The danger with cross-corpus AI analysis is that a plausible-looking pattern can be coincidence or hallucination, and acting on a false pattern wastes real engineering time. So verification is non-negotiable and it’s a human job. For each top pattern, pull the cited incidents and read the actual text yourself. Does the pattern hold up, or did the model group three things that share surface words but not substance? Roughly one in four candidate patterns won’t survive this check, and killing the false ones is exactly the human contribution.

The patterns that do survive are gold, because they’re the ones backed by evidence across many independent incidents. Those are the systemic risks worth real investment. The model couldn’t have found them without reading everything, and you can’t trust them without verifying each one — both halves are necessary, and they belong to different parties.

The hard line on action

Everything in this workflow is analysis, and analysis is where AI belongs. The model reads, clusters, and cites. Humans verify, decide, and prioritize. And nothing in this loop goes anywhere near a production action — this is strategic reflection on past incidents, conducted well after the fact, specifically to find systemic fixes. If your corpus-analysis tooling can do anything other than produce a document, you’ve over-built it.

It’s also worth being explicit that the model doesn’t get to decide what’s worth fixing. Frequency is one input, but the cost of a pattern, the cost of fixing it, and the org’s current priorities are human judgments the model has no basis for. A pattern that shows up ten times but costs nothing might rank below one that shows up twice but threatens a key customer. The AI ranks by what it can count; humans rank by what actually matters. Synthesis from the machine, decisions from the people.

Pro Tip: Run this analysis on a cadence — quarterly works well — and compare against the last run. Patterns that shrink tell you your investments are working; patterns that grow despite a fix tell you the fix didn’t address the real cause. The trend across runs is often more informative than any single run, and it’s the closest thing incident response has to a reliability scorecard.

From archive to advantage

A postmortem archive that nobody reads in aggregate is a cost — all that careful writing, locked in documents that only ever fixed one bug each. Run it through cross-corpus synthesis and it becomes your single best source of strategic reliability work, telling you not what broke once but what keeps breaking and why.

The work of incident response isn’t finished when the postmortem is written. It’s finished when the systemic pattern behind a dozen postmortems gets fixed — and finding that pattern is exactly the kind of wide, tedious read you should hand to a model so the humans have energy for the deciding. Explore more incident response practice, and find reusable analysis prompts in the prompt library and prompt packs.