Incident Recovery Verification Checklist Prompt
Build a rigorous all-clear checklist so an incident is declared resolved only after recovery is verified end-to-end — not just when the obvious symptom disappears.
- Target user
- Incident commanders and SREs deciding when to call all-clear
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has seen incidents re-open an hour after a premature all-clear because someone confirmed the dashboard was green but not that the system was actually healthy. I will provide: - The affected service(s) and architecture - The primary symptom and the mitigation applied (rollback, failover, scale-up, flag flip) - Available signals (SLO dashboards, synthetic checks, queue depths, error rates) - Downstream consumers and any data integrity concerns Build a recovery verification checklist. Work through these steps: 1. **Separate symptom from health** — list the difference between "the alert cleared" and "the system is genuinely recovered." Name the false-recovery traps for this service (cached results, drained-then-refilling queues, masked errors, partial failover). 2. **Define verification layers** — checks at each layer: (a) the failing signal itself, (b) golden SLO signals, (c) synthetic / real user journeys, (d) downstream dependents, (e) data integrity / backlog drain, (f) the mitigation's side effects (e.g., is the rollback stable, is the scaled-up capacity sustainable). 3. **Set hold-and-watch criteria** — how long signals must stay healthy before all-clear, and what bounce-back would re-open the incident. 4. **Handle leftover risk** — temporary mitigations still in place (a flag off, capacity over-provisioned, a node cordoned) that must be tracked as follow-ups, not forgotten at all-clear. 5. **Verify the backlog** — queues, retries, dead-letter, delayed jobs, and reconciliation that must be confirmed drained or scheduled. 6. **Write the all-clear gate** — the explicit go/no-go the commander reads aloud before declaring resolution, plus who must confirm. Output: (a) a layered verification checklist with pass criteria, (b) the false-recovery trap list for this service, (c) hold-and-watch durations with bounce-back triggers, (d) a follow-up tracker for leftover mitigations, (e) the spoken all-clear gate. Bias toward proving recovery, not assuming it.