Prioritizing Reliability Work Across a Quarter of

Reliability roadmaps are usually written by the most recent fire. The big outage happens, leadership asks “how do we make sure this never happens again,” the action items from that one postmortem jump the queue, and the team spends the next month building defenses against a failure that, statistically, may never recur in that exact form. Meanwhile a quieter gap that contributed to six smaller incidents over the quarter never gets funded, because no single one of those was loud enough to demand a meeting.

This is recency bias running your reliability strategy, and it’s why prioritizing one incident at a time reliably produces the wrong answer. The right unit of analysis isn’t the incident — it’s the portfolio of incidents. The question isn’t “what should we fix about the last outage,” it’s “across everything that broke this quarter, where does a small amount of engineering time buy the most reliability.” Answering that means synthesizing across a stack of postmortems, which is exactly the kind of cross-document work nobody has time to do by hand.

The same gap wears different clothes

The first problem with cross-incident analysis is that the same underlying gap shows up in different postmortems described in completely different words. One writeup says “config drift between staging and prod.” Another says “the deploy didn’t validate the manifest.” A third says “staging didn’t catch it because it doesn’t mirror production.” These might be three faces of one gap — environment parity — and you only realize that gap is worth serious investment when you see all three together and the count clicks: this didn’t contribute to one incident, it contributed to three.

Doing that clustering manually across twenty postmortems is brutal, and it’s the first thing AI earns its place on. It can normalize the contributing factors and action items across a whole set, then group the ones that are the same problem in different language.

Across this set of recent postmortems, prioritize reliability work.

1. Extract a normalized list of contributing factors and action
   items from all incidents. Cluster items that are the same
   underlying gap described in different words.
2. For each cluster: how many incidents it touched, the combined
   severity/duration it contributed to, and whether it's a
   prevent / detect / mitigate gap.
3. Rank clusters by leverage-per-effort. Effort is coarse S/M/L —
   mark every effort guess as [ESTIMATE].
4. RECENCY CHECK: which visible recent item ranks LOWER than it
   feels like it should, and which quiet recurring gap ranks
   HIGHER?
5. Recommend a "fund these N" shortlist and name what you'd
   consciously NOT fund this quarter.

Rules: Cluster on systems and gaps, NEVER on which team caused
more incidents. Mark every effort/leverage value as [ESTIMATE].

Postmortems: <paste summaries>

The recency check is the whole point

Step 4 is the part that fights the bias directly, and it’s where the analysis stops being a spreadsheet and starts being a decision. The model is asked to name, explicitly, which loud recent item ranks lower than it feels like it should — and which boring recurring gap ranks higher. That framing forces the comparison everyone avoids, because saying “the thing that just embarrassed us in front of the customer is not actually our highest-leverage fix” is politically uncomfortable, and a model will say it plainly.

After the human prunes and adds real effort estimates, the output reads like a portfolio view:

Top clusters by leverage (effort is [ESTIMATE], confirm before funding):

Underlying gap Incidents touched Combined impact Type Effort
Environment parity (staging ≠ prod) 3 2 Sev2, 1 Sev3 Detect/Prevent M
Missing burn-rate alerting on tier-1 SLOs 4 contributed to slow detection in 4 Detect S
The recent payments outage’s specific bug 1 1 Sev1 Prevent M

Recency note: the payments outage feels like the priority because it was the loudest, but it was a one-off bug in a path we’ve now patched. The burn-rate alerting gap (Effort: S) contributed to slow detection across four incidents and is cheaper to fix. It should rank above the payments-specific hardening.

Not funding this quarter: broad service-mesh migration suggested in two postmortems — high effort, and the parity and alerting fixes address most of the same pain at a fraction of the cost.

Underlying gap	Incidents touched	Combined impact	Type	Effort
Environment parity (staging ≠ prod)	3	2 Sev2, 1 Sev3	Detect/Prevent	M
Missing burn-rate alerting on tier-1 SLOs	4	contributed to slow detection in 4	Detect	S
The recent payments outage’s specific bug	1	1 Sev1	Prevent	M

That recency note is the sentence a roadmap meeting needs and rarely produces on its own.

Coarse estimates, marked as such

Everything about leverage and effort here is an estimate, and the prompt insists on labeling it. This matters because a clustered, ranked, confident-looking table is very easy to mistake for a funded plan. It isn’t. The incident counts are real; the effort sizing is the model’s guess until an engineer who knows the systems replaces S/M/L with something grounded. Marking every estimate keeps a prioritization input from being quoted as a prioritization decision in the next planning doc.

The one line you must not cross

There’s a specific way this analysis goes toxic, and it’s worth stating bluntly: the moment you cluster on who caused incidents instead of what gaps caused them, you’ve built a tool for ranking teams by blame. “The payments team had four incidents this quarter” is not analysis — it’s a leaderboard, it’s a blameless-culture violation at organizational scale, and it makes every future postmortem less honest because nobody wants to feed the leaderboard. The prompt is explicitly instructed to cluster on systems and gaps only, never on team incident counts, and that constraint is non-negotiable. Cross-incident analysis is powerful precisely because it’s the level at which blame is most tempting and most damaging.

The human owns the funding call

The model produces a ranked, de-biased, clustered view across the whole quarter. It does not decide the budget. A person who understands the real effort, the team’s capacity, and the things that aren’t in any postmortem — the architectural debt everyone knows about, the migration already half-done — makes the actual call about what gets funded and what gets deferred. The value of the AI pass is that the call is now informed by all the data instead of by whichever incident shouted loudest last week.

The portfolio-prioritization prompt is in the prompts library, and it pairs with the recurring-pattern work that surfaces systemic themes in the first place. Together they turn a pile of postmortems from a history archive into a reliability roadmap.

Stop letting the last fire write your quarter. Look across all of them, find the gap that’s worth more than any single incident, and fund that.

Prioritizing Reliability Work Across a Quarter of Postmortems With AI