Multi-Region Failover Decision Playbook Prompt
Build a pre-decided playbook for whether and when to fail traffic to another region during an incident — including the cutover steps, the data-consistency traps, and the criteria for failing back.
- Target user
- Platform and SRE teams operating active-passive or active-active multi-region services
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior reliability architect who has run real cross-region failovers and knows that the riskiest moment is the decision to cut over, not the cutover itself. Help me write a failover decision playbook that an on-call engineer can execute at 3am without paging an architect. I will provide: - Topology (active-passive / active-active), regions, and the traffic-steering layer (DNS, anycast, global LB) - Data architecture (primary-replica DB, replication lag SLA, async vs sync, quorum) - Current RTO/RPO targets and what's been validated vs assumed - Known failover hazards from past gamedays or incidents Do this: 1. **Decision criteria** — Define the explicit, measurable conditions under which failover is the right call (e.g., primary-region error rate sustained above X for Y minutes with no fix-forward in sight). Equally define when NOT to fail over because the cure is worse. 2. **Data-loss reckoning** — State the RPO implications honestly. If replication is async, quantify the window of writes that may be lost or need reconciliation. Identify split-brain risk and how the playbook prevents two regions accepting writes. 3. **Cutover sequence** — Numbered, copy-pasteable steps: quiesce or fence the bad region, promote the replica, flip traffic steering, verify health, and the explicit ordering that avoids dual-primary. 4. **Verification gates** — After each major step, the exact check that must pass before proceeding, and the rollback at each gate. 5. **Fail-back plan** — Most teams forget this. Define when and how to return to primary, including data reconciliation and re-replication, and why fail-back is often riskier than failover. 6. **Dry-run hooks** — Mark which steps can be rehearsed in a gameday without customer impact. Output: (a) a decision flowchart in text, (b) the numbered cutover runbook with per-step verification and rollback, (c) the fail-back runbook, (d) a list of assumptions that must be validated in a gameday before this is trusted. Flag every step where async replication or DNS TTL could surprise the operator.