Skip to content
CloudOps
Newsletter
All guides
Reduce MTTR with AI By James Joyner IV · · 11 min read

The MTTR Retro: Using AI to Find and Kill Recurring Time-Sinks

Your MTTR is dragged down by the same time-sinks every incident. Use AI to mine your retros, find the recurring drains, and kill them — verify-first, not vibes.

  • #reduce-mttr
  • #mttr
  • #ai
  • #retrospective
  • #sre

I read twelve of our postmortems back-to-back one slow afternoon and noticed something that no single retro had ever surfaced: in nine of them, somebody had lost five-plus minutes finding the right dashboard. Nine separate retros, each with an action item like “improve dashboard discoverability,” none of them ever done, because in isolation it always looked like a papercut. Stacked up, it was the single biggest recurring drain on our MTTR — a half-hour a month, invisible because it was spread across a dozen incidents. The time-sinks killing your MTTR usually aren’t dramatic. They’re the same boring five minutes, lost over and over, in a place no individual retro is shaped to catch.

The per-incident retro is the wrong altitude for finding these. You need to look across incidents, and reading a quarter of postmortems looking for patterns is exactly the tedious, high-volume reading AI is built for.

Per-incident retros miss the cross-incident pattern

A good single-incident retro asks “what slowed this one down” and produces a fix for this failure mode. That’s valuable, but it’s blind to repetition. The dashboard-hunting that cost five minutes here and five minutes there never rises to a “root cause” in any one retro, so it never gets the priority it deserves. The recurring time-sinks hide in the aggregate — the phase of the MTTR funnel that consistently runs long across many incidents — and you only see them by reading many retros at once with a pattern-finding lens.

That’s the AI job: ingest a batch of postmortems, attribute lost time to phases, and surface the drains that repeat. Not to decide what to fix — to show you where the minutes actually went, across the whole quarter, so you stop guessing.

Mine the retros for repeated drains

First, get your retros into one place in a parseable shape:

# Collect a quarter of postmortems and their timeline/lessons sections
grep -l "pubDate.*2026-0[4-6]" postmortems/*.md \
  | xargs grep -A6 "^## Timeline\|^## What slowed us down" > /tmp/retro-batch.txt
# Rough count of how often a phase shows up as a complaint
grep -ioE "dashboard|runbook|ownership|access|escalat|alert noise" /tmp/retro-batch.txt \
  | sort | uniq -c | sort -rn

That grep is a blunt first pass — it’ll show you the obvious words but miss the paraphrases. Hand the batch to a model for the real pattern-finding:

You are analyzing a quarter of incident retrospectives to find recurring MTTR time-sinks. From these postmortems, identify the cross-incident patterns: which phase of incident response (detect, acknowledge, triage, diagnose, mitigate, verify) repeatedly ran long, and the specific recurring cause. For each pattern: how many of the incidents it appears in, the total estimated time lost, representative quotes (with which incident each is from), and the single highest-leverage change that would address it. Rank by total time lost. Attribute every claim to specific incidents. Do not invent numbers — if time lost isn’t stated, say “not quantified” and count occurrences instead.

The output reframes a quarter of papercuts as a priority list:

  1. Dashboard discovery (diagnose phase) — appears in 9/12 incidents. “Spent 6 min finding the right Grafana board” (INC-204), “no link to the relevant dashboard from the alert” (INC-211). Highest leverage: link the dashboard directly from the alert. Est. recurring: large, ~5 min/incident.
  2. Ownership ambiguity (triage phase) — 5/12. “Took 8 min to find who owned the service after the reorg” (INC-198). Highest leverage: a maintained ownership registry the triage step reads.
  3. Premature resolution (verify phase) — 3/12 reopened. “Declared resolved, re-paged 40 min later” (INC-220). Highest leverage: a standard post-remediation verification checklist.

Now the dashboard problem isn’t a papercut in one retro — it’s the #1 drain across nine, with the quotes to prove it and an obvious fix. That’s a case you can actually take to a planning meeting.

Verify the pattern before you spend the engineering

The trap with cross-incident analysis is over-trusting the synthesis. The model said “9/12” — I check that, because an action item that ships engineering effort needs to be real, not a hallucinated tally. The attribution requirement is what makes verification possible:

# Confirm the model's "9/12" claim by checking the cited incidents actually mention it
for f in INC-204 INC-211 INC-198 INC-220; do
  grep -l "dashboard" postmortems/*$f* 2>/dev/null
done

If the cited incidents don’t actually contain the complaint, the pattern is inflated and I downgrade it. Verify-first applies to analysis just as much as to live incidents — a confidently-wrong retro finding sends a quarter of engineering effort at the wrong target. The model proposes the ranked drains; I confirm the counts before anything goes on a roadmap.

Turn verified drains into killed time-sinks

A pattern you’ve confirmed is only worth finding if you actually kill it. The fix for the meta-problem — action items that never get done — is to make each verified drain a concrete, owned change, not another “improve discoverability”:

  • Dashboard discovery → add the dashboard URL to the alert annotation. One Alertmanager config change, one owner, done this sprint.
    annotations:
      dashboard: "https://grafana/d/payments?var-region={{ $labels.region }}"
  • Ownership ambiguity → a checked-in ownership.yaml the triage step reads, with a CI check that fails if a service has no owner.
  • Premature resolution → a standard verification checklist generated per incident before close.

Each one targets a phase that the retro analysis proved runs long, so the engineering goes where the minutes actually are instead of where the loudest recent incident was.

A few practices that make the MTTR retro pay off:

  • Run it on a batch, not one incident. The whole value is cross-incident. A single retro can’t see repetition by definition.
  • Rank by total time lost, not drama. The scariest incident isn’t usually your biggest MTTR drain; the boring repeated five minutes is.
  • Close the loop. Re-run the analysis next quarter and confirm the killed drains actually dropped out of the top of the list. If “dashboard discovery” is still #1, the fix didn’t land.

You can prototype this on the free incident assistant: paste several retro summaries and ask for the cross-incident pattern analysis, then verify the counts before you act. The prompt library has the retro-mining prompt with the attribution and no-invented-numbers rules built in.

Your MTTR isn’t dragged down by exotic once-a-year disasters. It’s dragged down by the same five minutes, lost in the same phase, across a dozen incidents nobody connected. AI reads the whole pile and surfaces the pattern; you verify the counts and ship the owned fix. Kill the recurring drains and your MTTR drops not because any one incident went better, but because all of them stopped wasting the same time.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.