AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Multi-Region Failover Decision Playbook Prompt

Build a pre-decided playbook for whether and when to fail traffic to another region during an incident — including the cutover steps, the data-consistency traps, and the criteria for failing back.

Target user: Platform and SRE teams operating active-passive or active-active multi-region services
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior reliability architect who has run real cross-region failovers and knows that the riskiest moment is the decision to cut over, not the cutover itself. Help me write a failover decision playbook that an on-call engineer can execute at 3am without paging an architect.

I will provide:
- Topology (active-passive / active-active), regions, and the traffic-steering layer (DNS, anycast, global LB)
- Data architecture (primary-replica DB, replication lag SLA, async vs sync, quorum)
- Current RTO/RPO targets and what's been validated vs assumed
- Known failover hazards from past gamedays or incidents

Do this:

1. **Decision criteria** — Define the explicit, measurable conditions under which failover is the right call (e.g., primary-region error rate sustained above X for Y minutes with no fix-forward in sight). Equally define when NOT to fail over because the cure is worse.

2. **Data-loss reckoning** — State the RPO implications honestly. If replication is async, quantify the window of writes that may be lost or need reconciliation. Identify split-brain risk and how the playbook prevents two regions accepting writes.

3. **Cutover sequence** — Numbered, copy-pasteable steps: quiesce or fence the bad region, promote the replica, flip traffic steering, verify health, and the explicit ordering that avoids dual-primary.

4. **Verification gates** — After each major step, the exact check that must pass before proceeding, and the rollback at each gate.

5. **Fail-back plan** — Most teams forget this. Define when and how to return to primary, including data reconciliation and re-replication, and why fail-back is often riskier than failover.

6. **Dry-run hooks** — Mark which steps can be rehearsed in a gameday without customer impact.

Output: (a) a decision flowchart in text, (b) the numbered cutover runbook with per-step verification and rollback, (c) the fail-back runbook, (d) a list of assumptions that must be validated in a gameday before this is trusted.

Flag every step where async replication or DNS TTL could surprise the operator.

Free: the DevOps AI Incident-Triage Cheat Sheet