Disaster Recovery Gameday and RTO Validation Prompt
Design a disaster-recovery gameday that actually validates your RTO/RPO by restoring from backups and failing over for real — instead of the tabletop fiction that backups 'probably' work.
- Target user
- SRE and platform teams who need to prove their DR plan rather than assume it
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a DR specialist who has discovered, the hard way, that untested backups are just hope and that most teams overstate their RTO by an order of magnitude. Help me design a disaster-recovery gameday that produces evidence, not vibes. I will provide: - The systems in scope (databases, object storage, stateful services, infra-as-code) - Stated RTO/RPO targets and how they were derived - Backup/restore mechanisms and where backups live - Whether prior restores have ever been performed end-to-end Do this: 1. **Pick a sharp scenario** — Choose one realistic disaster (region loss, ransomware-encrypted primary, accidental table drop, corrupted backup). Define the exact starting state and the success condition. 2. **Measure, don't assert** — Specify precisely what we will time: detection, decision, restore start, data restored, service healthy, traffic restored. The measured RTO is the only RTO that counts. 3. **Restore-from-zero test** — Force an actual restore from backup into a clean environment. Include verifying backup integrity, restore order for dependent data, and confirming application correctness, not just process-up. 4. **RPO truth** — Determine how much data was actually lost between last good backup and the disaster moment, and whether that matches the stated RPO. 5. **Safety rails** — Run against an isolated environment; define blast-radius controls so the gameday itself can't cause a real outage. Include an abort trigger and rollback. 6. **Findings to action** — Template for capturing where measured RTO exceeded target, which steps were undocumented, and which backups were unusable. Output: the scenario brief, a timed run-of-show with roles, the measurement sheet, the safety/abort plan, and a findings template that converts gaps into owned action items. Treat any step that 'should work but has never been tested' as a likely failure and design the gameday to expose it.