Post-Fix Verification Checklist Prompt

Build the queries and checks that confirm the fix actually resolved the incident — across the metric, the user, and the dependencies — before calling the all-clear, so you don't reopen later and inflate real MTTR.

Target user

On-call SREs and incident commanders closing out an incident

Difficulty

Intermediate

Tools

Claude, ChatGPT, Cursor

You are a senior SRE who has been burned by premature all-clears — the graph dipped, someone called it resolved, and it came roaring back twenty minutes later. Help me build a verification checklist that proves the fix actually worked before we close. Paste your context: - The incident and its primary symptom: [WHAT BROKE + THE KEY METRIC] - The fix that was applied: [WHAT WE CHANGED / ROLLED BACK / RESTARTED] - Signals available: [METRICS / DASHBOARDS / LOGS / SYNTHETIC CHECKS] - The original alert/SLO: [THRESHOLD + WINDOW] Build the verification plan: 1. **Define "resolved" precisely** — state the exact condition that must hold to declare all-clear: which metric, back below which threshold, sustained for how long. A momentary dip is not resolution. 2. **Layered checks** — give the checks at three layers: the firing metric itself, the user-facing signal (synthetic/real-user/error budget), and the downstream dependencies that may have been affected or backed up. A fix can clear the alert while leaving a queue draining or a cache cold. 3. **Write the actual queries** — for each check, the concrete read-only query or command, with my placeholders filled in, and the result that means "good". 4. **Watch for rebound and side effects** — list what would indicate the fix is masking rather than resolving (e.g. retries hiding errors, traffic shifted not recovered) and any new problem the fix could have introduced. 5. **Stand-down criteria and monitoring window** — state how long to keep watching after the metric recovers and what to keep an eye on post-close. Output format: a "RESOLVED WHEN" definition, then a checklist table — layer | check | query | expected-good-result | status. Then "REBOUND WATCH" and "STAND-DOWN AFTER". Every query must be read-only. Rank the checks so the most decisive one is first, and present this as a checklist for the human to run and confirm — do not declare the incident resolved yourself.

Why this prompt works

This targets the verify phase, the most under-invested stage of the MTTR funnel. Teams pour energy into diagnosis and mitigation, then close on a single recovered graph — and a premature all-clear is one of the worst things for true MTTR, because a reopened incident counts the whole second round on top of the first. Disciplined verification is cheaper than re-living the outage.

The prompt insists on a precise, sustained definition of “resolved” rather than a momentary dip, and it spreads checks across three layers — the firing metric, the user-facing signal, and downstream dependencies — because fixes routinely clear the alert while leaving a queue draining or a cache cold. Concrete, filled-in read-only queries turn that into something the responder can actually run in the closing minutes.

The rebound-watch and stand-down criteria encode the hard-won instinct that a fix can mask rather than resolve. By keeping every query read-only and leaving the resolution call to the human, the prompt gives responders a fast, thorough confirmation routine without letting the AI prematurely declare victory — which is exactly the discipline that keeps a clean-looking MTTR number honest.

Post-Fix Verification Checklist Prompt

Why this prompt works

Related prompts

Live Incident Scribe and Timeline Prompt

Runbook and Next-Step Surfacer Prompt

Why this prompt works

Related prompts

Live Incident Scribe and Timeline Prompt

Runbook and Next-Step Surfacer Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet