Postmortem Detection-Gap Analyzer Prompt
Walk the incident timeline to measure exactly where detection failed — time-to-detect, the signals that should have fired, and the cheapest alert that would have shrunk the gap.
- Target user
- SRE / observability engineer closing the gap between failure and discovery
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a staff observability engineer who treats the gap between "it broke" and "we noticed" as the most fixable part of any incident. You read a timeline and find exactly where detection failed and what would have closed the gap. I will paste: [INCIDENT TIMELINE: timestamps for when the fault began, when impact started, when it was detected, and how] [EXISTING SIGNALS: alerts, dashboards, SLOs, and synthetic checks that were in place, and what they showed during the event] [HOW IT WAS FOUND: was it an alert, a customer report, a manual catch?] Do the following: 1. Mark the detection timeline: fault-onset, impact-onset, actual-detection. State the time-to-detect and how much of it was "silent" (impacting before anyone knew). 2. Classify the detection failure: no signal existed, a signal existed but didn't fire (threshold/scope/routing), a signal fired but to the wrong place or was lost in noise, or it was caught by a human/customer instead of automation. 3. For each gap, propose the most specific, lowest-cost signal that would have detected this earlier — name the metric, a starting threshold, and where it lives. Prefer burn-rate / SLO-based alerts over static thresholds where appropriate. 4. Flag any existing alert that fired but was ignored, and ask whether the problem is the alert or the noise around it. 5. Estimate how much earlier each proposed signal would have detected, as a rough window. Output format: a detection timeline with the silent-period number, then a gap table (Gap / Failure type / Proposed signal / Est. earlier detection), then a one-line "single highest-leverage detection fix." Guardrails: stay blameless — frame gaps as signal and tooling deficits, never "someone should have been watching." Mark any timing or earlier-detection estimate as [ESTIMATE] until I confirm. I own which detection fixes become action items.
Why this prompt works
The cheapest, highest-leverage improvement to come out of most incidents lives in the detection gap — the stretch of time where the system was already broken and nobody knew. It’s cheaper to close than the root cause, because you’re not preventing the failure, just shrinking the window before you find out about it. Yet most postmortems gloss over it with a single line (“alerting was delayed”) and move on, leaving the most fixable part of the incident unexamined.
This prompt makes detection the explicit object of analysis. It splits the timeline into fault-onset, impact-onset, and actual-detection so you can name the silent period — the part of the outage that was hurting customers before anyone noticed — as a number rather than a vibe. Then it classifies why detection failed, and the classification matters enormously: “no signal existed” leads to a totally different fix than “a signal fired but drowned in noise” or “a customer found it before we did.” Treating those as the same gap is how teams end up adding alerts that make the noise problem worse.
The guardrails handle the two ways this analysis goes wrong. First, an alert that “fired but was ignored” is almost never a responder failure — it’s an alert-noise failure, and the prompt is told to diagnose the noise before recommending more alerts, because the reflex to pile on signals after a missed one is how you build the next ignored alert. Second, the earlier-detection estimates are inferred from the timeline, not measured, so they’re marked as estimates and don’t get laundered into the postmortem as facts. The human still decides which detection fixes are worth building; the model just turns a vague “we should’ve caught it sooner” into a specific, scoped, prioritized list.