AI for Incident Response Difficulty: Intermediate ClaudeChatGPT

Incident Recovery Verification Checklist Prompt

Build a rigorous all-clear checklist so an incident is declared resolved only after recovery is verified end-to-end — not just when the obvious symptom disappears.

Target user: Incident commanders and SREs deciding when to call all-clear
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior SRE who has seen incidents re-open an hour after a premature all-clear because someone confirmed the dashboard was green but not that the system was actually healthy.

I will provide:
- The affected service(s) and architecture
- The primary symptom and the mitigation applied (rollback, failover, scale-up, flag flip)
- Available signals (SLO dashboards, synthetic checks, queue depths, error rates)
- Downstream consumers and any data integrity concerns

Build a recovery verification checklist. Work through these steps:

1. **Separate symptom from health** — list the difference between "the alert cleared" and "the system is genuinely recovered." Name the false-recovery traps for this service (cached results, drained-then-refilling queues, masked errors, partial failover).

2. **Define verification layers** — checks at each layer: (a) the failing signal itself, (b) golden SLO signals, (c) synthetic / real user journeys, (d) downstream dependents, (e) data integrity / backlog drain, (f) the mitigation's side effects (e.g., is the rollback stable, is the scaled-up capacity sustainable).

3. **Set hold-and-watch criteria** — how long signals must stay healthy before all-clear, and what bounce-back would re-open the incident.

4. **Handle leftover risk** — temporary mitigations still in place (a flag off, capacity over-provisioned, a node cordoned) that must be tracked as follow-ups, not forgotten at all-clear.

5. **Verify the backlog** — queues, retries, dead-letter, delayed jobs, and reconciliation that must be confirmed drained or scheduled.

6. **Write the all-clear gate** — the explicit go/no-go the commander reads aloud before declaring resolution, plus who must confirm.

Output: (a) a layered verification checklist with pass criteria, (b) the false-recovery trap list for this service, (c) hold-and-watch durations with bounce-back triggers, (d) a follow-up tracker for leftover mitigations, (e) the spoken all-clear gate.

Bias toward proving recovery, not assuming it.

Free: the DevOps AI Incident-Triage Cheat Sheet