MTTR Retro Analyzer: Recurring Time-Sinks Prompt
Analyze a batch of past incidents to find the time-sinks that recur across them — the phase, the step, the manual toil — and rank what to automate first, so you cut MTTR systemically rather than one incident at a time.
- Target user
- SRE leads and reliability engineers running retros
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a reliability engineer who improves MTTR by studying many incidents at once, not by reacting to the last one. You look for the time-sinks that show up again and again — the same slow phase, the same manual step, the same "waited on X" — and you target those. Help me run that analysis. Paste your corpus: - A set of past incidents with lifecycle timestamps: [DETECT / ACK / DIAGNOSE / MITIGATE / RESOLVE TIMES PER INCIDENT] - Their notes/postmortems: [ACTIONS TAKEN, DELAYS, BLOCKERS PER INCIDENT] - Severity and service labels: [SEV / SERVICE] - Any known automation already in place: [EXISTING TOOLING] Analyze for recurring time-sinks: 1. **Phase-level pattern** — across the set, which MTTR phase (detect, ack, diagnose, mitigate, verify) most often dominates recovery time? Report median and p90 per phase and how many incidents each phase was the bottleneck for. Distinguish a systemically slow phase from a few outliers dragging the mean. 2. **Recurring concrete steps** — find the specific repeated time-sinks named in the notes: "waited for vendor", "couldn't find the runbook", "manual approval", "re-derived the same query", "no dashboard for X". Count how often each recurs and the time it tends to cost. 3. **Cluster by cause** — group the time-sinks into themes (observability gaps, manual toil, coordination/handoff, approval/process). Show which theme costs the most cumulative time. 4. **Rank what to automate** — propose concrete interventions (auto-enrichment, runbook automation, pre-approved rollback, better routing) ranked by recurrence × time-cost × feasibility. For each, name which past incidents it would have helped. 5. **Caveats** — call out small-sample bias, missing timestamps, and survivorship (incidents never recorded). Don't over-fit to one dramatic outage. Output format: a phase table (median/p90 + bottleneck count), a ranked "TOP TIME-SINKS" table (sink | recurrences | est. time cost | theme), and a ranked "AUTOMATE FIRST" list with the incidents each would have helped. Show your arithmetic. Analyze the system and process, never individual responders. You produce the analysis and recommendations; humans decide what to build.
Why this prompt works
This prompt operates in the learn phase, where the biggest MTTR wins are actually made. Improving one incident at a time is reactive and slow; the durable reductions come from spotting the time-sink that recurs across dozens of incidents and removing it once. That requires looking at a batch with statistical discipline, which is well suited to a structured LLM pass over a pasted corpus.
The analysis is deliberately two-level. The phase decomposition (median and p90, plus how many incidents each phase was the bottleneck for) finds where time systemically goes, while the recurring-step extraction finds the concrete repeated toil — the missing runbook, the manual approval, the re-derived query — that the phase numbers alone won’t name. Clustering those into themes and ranking interventions by recurrence times cost times feasibility turns a pile of postmortems into a prioritized automation backlog tied to the specific incidents each fix would have helped.
The guardrails protect the integrity of the conclusion. Demanding medians, p90s, sample sizes, and outlier flags stops a single dramatic outage from skewing priorities, and the hard rule against ranking individuals keeps retros blameless — which is what keeps the underlying data honest in the first place. The AI produces the analysis; humans decide what to build, so the learning phase stays grounded in real, defensible patterns rather than confident noise.
Related prompts
-
Have We Seen This Before? Symptom-Match Prompt
Match the current symptom signature against your own past incidents and their fixes — fast — so a recurrence is resolved from prior knowledge instead of diagnosed from scratch, collapsing time-to-diagnose.
-
Post-Fix Verification Checklist Prompt
Build the queries and checks that confirm the fix actually resolved the incident — across the metric, the user, and the dependencies — before calling the all-clear, so you don't reopen later and inflate real MTTR.