Post-Fix Verification Checklist Prompt
Build the queries and checks that confirm the fix actually resolved the incident — across the metric, the user, and the dependencies — before calling the all-clear, so you don't reopen later and inflate real MTTR.
- Target user
- On-call SREs and incident commanders closing out an incident
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior SRE who has been burned by premature all-clears — the graph dipped, someone called it resolved, and it came roaring back twenty minutes later. Help me build a verification checklist that proves the fix actually worked before we close. Paste your context: - The incident and its primary symptom: [WHAT BROKE + THE KEY METRIC] - The fix that was applied: [WHAT WE CHANGED / ROLLED BACK / RESTARTED] - Signals available: [METRICS / DASHBOARDS / LOGS / SYNTHETIC CHECKS] - The original alert/SLO: [THRESHOLD + WINDOW] Build the verification plan: 1. **Define "resolved" precisely** — state the exact condition that must hold to declare all-clear: which metric, back below which threshold, sustained for how long. A momentary dip is not resolution. 2. **Layered checks** — give the checks at three layers: the firing metric itself, the user-facing signal (synthetic/real-user/error budget), and the downstream dependencies that may have been affected or backed up. A fix can clear the alert while leaving a queue draining or a cache cold. 3. **Write the actual queries** — for each check, the concrete read-only query or command, with my placeholders filled in, and the result that means "good". 4. **Watch for rebound and side effects** — list what would indicate the fix is masking rather than resolving (e.g. retries hiding errors, traffic shifted not recovered) and any new problem the fix could have introduced. 5. **Stand-down criteria and monitoring window** — state how long to keep watching after the metric recovers and what to keep an eye on post-close. Output format: a "RESOLVED WHEN" definition, then a checklist table — layer | check | query | expected-good-result | status. Then "REBOUND WATCH" and "STAND-DOWN AFTER". Every query must be read-only. Rank the checks so the most decisive one is first, and present this as a checklist for the human to run and confirm — do not declare the incident resolved yourself.
Why this prompt works
This targets the verify phase, the most under-invested stage of the MTTR funnel. Teams pour energy into diagnosis and mitigation, then close on a single recovered graph — and a premature all-clear is one of the worst things for true MTTR, because a reopened incident counts the whole second round on top of the first. Disciplined verification is cheaper than re-living the outage.
The prompt insists on a precise, sustained definition of “resolved” rather than a momentary dip, and it spreads checks across three layers — the firing metric, the user-facing signal, and downstream dependencies — because fixes routinely clear the alert while leaving a queue draining or a cache cold. Concrete, filled-in read-only queries turn that into something the responder can actually run in the closing minutes.
The rebound-watch and stand-down criteria encode the hard-won instinct that a fix can mask rather than resolve. By keeping every query read-only and leaving the resolution call to the human, the prompt gives responders a fast, thorough confirmation routine without letting the AI prematurely declare victory — which is exactly the discipline that keeps a clean-looking MTTR number honest.
Related prompts
-
Live Incident Scribe and Timeline Prompt
Maintain a running, structured incident timeline as events happen — actions, findings, decisions — so handoffs transfer state instead of resetting it, keeping cumulative recovery time from compounding.
-
Runbook and Next-Step Surfacer Prompt
Match the live symptom to the right runbook and surface the exact command to run now — so the responder acts from a known-good procedure instead of improvising, shortening time-to-mitigate.