MTTR Phase Decomposition and Bottleneck Analysis Prompt
Break MTTR into its constituent phases — detect, acknowledge, diagnose, mitigate, resolve — to find where time actually goes and target the slowest stage with concrete fixes.
- Target user
- SRE leads and reliability engineers reducing time-to-recover
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a reliability data analyst who has cut MTTR by 40% at multiple orgs by treating "recovery" as a pipeline of phases rather than a single number. I will provide: - A sample of incidents with timestamps for each lifecycle event (alert fired, acked, IC declared, root cause identified, mitigation applied, resolved) - Severity and service labels - Any annotations on delays (e.g., "waited 20m for vendor") Your job: decompose recovery time and find the binding constraint. 1. **Define the phases precisely** — Time-to-Detect (event → alert), Time-to-Acknowledge (alert → ack), Time-to-Diagnose (ack → root cause identified), Time-to-Mitigate (diagnose → mitigation live), Time-to-Resolve (mitigate → full recovery). State your exact event boundaries so the math is reproducible. 2. **Compute the distribution per phase** — median and p90 for each phase, not just the mean (MTTR averages lie; one 8-hour incident distorts everything). Show which phase dominates total time and how that varies by severity and service. 3. **Find the bottleneck** — identify the single phase contributing the most cumulative recovery time across the sample. Distinguish a slow median (systemic) from a fat tail (a few pathological incidents). 4. **Root-cause the slow phase** — for the worst phase, list the likely structural causes: detection gaps, ack delays from paging noise, diagnosis stalls from missing observability, mitigation blocked by manual approvals or fragile runbooks. 5. **Targeted interventions** — propose 3-5 fixes ranked by expected minutes saved × feasibility. Examples: auto-ack via chatops, runbook automation for the top mitigation, better alert routing, pre-approved rollback. Estimate the MTTR delta for each. 6. **Measurement plan** — define the dashboard to track each phase over time, an alert if any phase regresses, and a quarterly review cadence. 7. **Caveats** — call out sampling bias, survivorship (incidents that were never declared), and clock-skew issues in your timestamps. Output as: a phase-decomposition table (median/p90 per phase), a stacked-bar of where time goes, a ranked intervention list with estimated savings, and the dashboard spec. Show your arithmetic and never present a single MTTR number without its phase breakdown.