Skip to content
CloudOps
Newsletter
All prompts
Reduce MTTR with AI Difficulty: Intermediate ClaudeChatGPTCursor

Post-Fix Verification Checklist Prompt

Build the queries and checks that confirm the fix actually resolved the incident — across the metric, the user, and the dependencies — before calling the all-clear, so you don't reopen later and inflate real MTTR.

Target user
On-call SREs and incident commanders closing out an incident
Difficulty
Intermediate
Tools
Claude, ChatGPT, Cursor

The prompt

You are a senior SRE who has been burned by premature all-clears — the graph dipped, someone called it resolved, and it came roaring back twenty minutes later. Help me build a verification checklist that proves the fix actually worked before we close.

Paste your context:
- The incident and its primary symptom: [WHAT BROKE + THE KEY METRIC]
- The fix that was applied: [WHAT WE CHANGED / ROLLED BACK / RESTARTED]
- Signals available: [METRICS / DASHBOARDS / LOGS / SYNTHETIC CHECKS]
- The original alert/SLO: [THRESHOLD + WINDOW]

Build the verification plan:

1. **Define "resolved" precisely** — state the exact condition that must hold to declare all-clear: which metric, back below which threshold, sustained for how long. A momentary dip is not resolution.

2. **Layered checks** — give the checks at three layers: the firing metric itself, the user-facing signal (synthetic/real-user/error budget), and the downstream dependencies that may have been affected or backed up. A fix can clear the alert while leaving a queue draining or a cache cold.

3. **Write the actual queries** — for each check, the concrete read-only query or command, with my placeholders filled in, and the result that means "good".

4. **Watch for rebound and side effects** — list what would indicate the fix is masking rather than resolving (e.g. retries hiding errors, traffic shifted not recovered) and any new problem the fix could have introduced.

5. **Stand-down criteria and monitoring window** — state how long to keep watching after the metric recovers and what to keep an eye on post-close.

Output format: a "RESOLVED WHEN" definition, then a checklist table — layer | check | query | expected-good-result | status. Then "REBOUND WATCH" and "STAND-DOWN AFTER". Every query must be read-only. Rank the checks so the most decisive one is first, and present this as a checklist for the human to run and confirm — do not declare the incident resolved yourself.

Why this prompt works

This targets the verify phase, the most under-invested stage of the MTTR funnel. Teams pour energy into diagnosis and mitigation, then close on a single recovered graph — and a premature all-clear is one of the worst things for true MTTR, because a reopened incident counts the whole second round on top of the first. Disciplined verification is cheaper than re-living the outage.

The prompt insists on a precise, sustained definition of “resolved” rather than a momentary dip, and it spreads checks across three layers — the firing metric, the user-facing signal, and downstream dependencies — because fixes routinely clear the alert while leaving a queue draining or a cache cold. Concrete, filled-in read-only queries turn that into something the responder can actually run in the closing minutes.

The rebound-watch and stand-down criteria encode the hard-won instinct that a fix can mask rather than resolve. By keeping every query read-only and leaving the resolution call to the human, the prompt gives responders a fast, thorough confirmation routine without letting the AI prematurely declare victory — which is exactly the discipline that keeps a clean-looking MTTR number honest.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week