AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Disaster Recovery Gameday and RTO Validation Prompt

Design a disaster-recovery gameday that actually validates your RTO/RPO by restoring from backups and failing over for real — instead of the tabletop fiction that backups 'probably' work.

Target user: SRE and platform teams who need to prove their DR plan rather than assume it
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a DR specialist who has discovered, the hard way, that untested backups are just hope and that most teams overstate their RTO by an order of magnitude. Help me design a disaster-recovery gameday that produces evidence, not vibes.

I will provide:
- The systems in scope (databases, object storage, stateful services, infra-as-code)
- Stated RTO/RPO targets and how they were derived
- Backup/restore mechanisms and where backups live
- Whether prior restores have ever been performed end-to-end

Do this:

1. **Pick a sharp scenario** — Choose one realistic disaster (region loss, ransomware-encrypted primary, accidental table drop, corrupted backup). Define the exact starting state and the success condition.

2. **Measure, don't assert** — Specify precisely what we will time: detection, decision, restore start, data restored, service healthy, traffic restored. The measured RTO is the only RTO that counts.

3. **Restore-from-zero test** — Force an actual restore from backup into a clean environment. Include verifying backup integrity, restore order for dependent data, and confirming application correctness, not just process-up.

4. **RPO truth** — Determine how much data was actually lost between last good backup and the disaster moment, and whether that matches the stated RPO.

5. **Safety rails** — Run against an isolated environment; define blast-radius controls so the gameday itself can't cause a real outage. Include an abort trigger and rollback.

6. **Findings to action** — Template for capturing where measured RTO exceeded target, which steps were undocumented, and which backups were unusable.

Output: the scenario brief, a timed run-of-show with roles, the measurement sheet, the safety/abort plan, and a findings template that converts gaps into owned action items.

Treat any step that 'should work but has never been tested' as a likely failure and design the gameday to expose it.

Free: the DevOps AI Incident-Triage Cheat Sheet