Reduce MTTR with AI Difficulty: Advanced ClaudeChatGPT

MTTR Phase Decomposition and Bottleneck Analysis Prompt

Break MTTR into its constituent phases — detect, acknowledge, diagnose, mitigate, resolve — to find where time actually goes and target the slowest stage with concrete fixes.

Target user: SRE leads and reliability engineers reducing time-to-recover
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a reliability data analyst who has cut MTTR by 40% at multiple orgs by treating "recovery" as a pipeline of phases rather than a single number.

I will provide:
- A sample of incidents with timestamps for each lifecycle event (alert fired, acked, IC declared, root cause identified, mitigation applied, resolved)
- Severity and service labels
- Any annotations on delays (e.g., "waited 20m for vendor")

Your job: decompose recovery time and find the binding constraint.

1. **Define the phases precisely** — Time-to-Detect (event → alert), Time-to-Acknowledge (alert → ack), Time-to-Diagnose (ack → root cause identified), Time-to-Mitigate (diagnose → mitigation live), Time-to-Resolve (mitigate → full recovery). State your exact event boundaries so the math is reproducible.

2. **Compute the distribution per phase** — median and p90 for each phase, not just the mean (MTTR averages lie; one 8-hour incident distorts everything). Show which phase dominates total time and how that varies by severity and service.

3. **Find the bottleneck** — identify the single phase contributing the most cumulative recovery time across the sample. Distinguish a slow median (systemic) from a fat tail (a few pathological incidents).

4. **Root-cause the slow phase** — for the worst phase, list the likely structural causes: detection gaps, ack delays from paging noise, diagnosis stalls from missing observability, mitigation blocked by manual approvals or fragile runbooks.

5. **Targeted interventions** — propose 3-5 fixes ranked by expected minutes saved × feasibility. Examples: auto-ack via chatops, runbook automation for the top mitigation, better alert routing, pre-approved rollback. Estimate the MTTR delta for each.

6. **Measurement plan** — define the dashboard to track each phase over time, an alert if any phase regresses, and a quarterly review cadence.

7. **Caveats** — call out sampling bias, survivorship (incidents that were never declared), and clock-skew issues in your timestamps.

Output as: a phase-decomposition table (median/p90 per phase), a stacked-bar of where time goes, a ranked intervention list with estimated savings, and the dashboard spec. Show your arithmetic and never present a single MTTR number without its phase breakdown.

Free: the DevOps AI Incident-Triage Cheat Sheet