Database Failover and Replication-Lag Decision Prompt

Decide during a live database incident whether to promote a replica, wait for the primary to recover, or hold — weighing replication lag, data-loss risk, and split-brain before you pull the trigger.

Target user

On-call engineers and incident commanders facing a degraded or unreachable primary database

Difficulty

Advanced

Tools

Claude, ChatGPT

You are a seasoned SRE and database on-call who knows that a failover done in a panic can cause more damage than the outage it's meant to fix — through data loss from replication lag or split-brain if the old primary comes back. I will paste the situation: the symptom (primary slow, unreachable, or read-only), current replication lag per replica, whether the primary is reachable at all, the topology (sync vs async replicas, quorum/automatic failover config), the write rate, and any recent change. Your job: 1. **State the failure mode** — classify whether the primary is degraded-but-alive, partitioned (reachable by some, not others), or genuinely down, since the right move differs sharply for each. 2. **Quantify data-loss exposure** — using the replication lag and write rate, estimate how many writes (and roughly which kind of data) would be lost if we promote the most-current replica right now. Be explicit that promotion with lag means accepting that loss. 3. **Split-brain risk** — assess whether the old primary could rejoin and accept writes after promotion, and what fencing/STONITH or config change is required to prevent two primaries. If you can't confirm fencing is in place, say promotion is unsafe. 4. **Option tree** — lay out the realistic choices: (a) wait for primary recovery, (b) promote the most-current replica now, (c) fail over to a synchronous replica with no loss, (d) go read-only and shed writes to buy time. For each: the recovery-time vs. data-loss tradeoff and the rollback. 5. **Recommendation with confidence** — name the option you'd lean toward and the single piece of evidence that would change your mind. Mark confidence. 6. **Execution checklist** — for the recommended path, the ordered steps including fencing the old primary, repointing the application/connection string, and the post-promotion data-integrity check. 7. **Recovery signal** — what proves writes are flowing to the new primary and replicas are re-syncing, and what to watch for the old primary trying to rejoin. Output as: (a) failure mode and data-loss estimate, (b) split-brain risk verdict, (c) the option tree with tradeoffs, (d) your recommendation and the execution checklist. Propose; the incident commander decides. Promotion is irreversible for any lagged writes — never present it as low-risk. If fencing status is unknown, default your recommendation toward the safe option and flag the unknown.

Why this prompt works

Database failover is one of the highest-stakes decisions an on-call engineer makes, because two of the failure modes — data loss from replication lag and split-brain corruption — are irreversible and far worse than the outage that prompted the decision. The instinct under pressure is to “just promote a replica and move on,” and that instinct is exactly what corrupts data. This prompt slows the decision down to the three questions that actually matter: how alive is the primary, how much data would promotion lose, and can the old primary come back to fight the new one.

The quantified data-loss step is the heart of it. By forcing an estimate of unreplicated writes from lag and write rate, the prompt makes the tradeoff concrete instead of abstract, so the commander is choosing “lose roughly N seconds of writes” rather than blindly pulling a lever. The split-brain check is equally deliberate: if fencing can’t be confirmed, the AI is instructed to call promotion unsafe rather than wave it through.

The guardrails put the irreversible call where it belongs — with the human incident commander. The AI lays out the option tree, quantifies the exposure, and builds the execution checklist, but it never frames an irreversible promotion as low-risk, and it defaults toward safety when fencing status is unknown. That’s the right division of labor for a decision this expensive.

Database Failover and Replication-Lag Decision Prompt

Why this prompt works

Related prompts

Incident Data Integrity Verification After Recovery Prompt

Multi-Region Failover Decision Playbook Prompt

Why this prompt works

Related prompts

Incident Data Integrity Verification After Recovery Prompt

Multi-Region Failover Decision Playbook Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet