Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for MySQL Difficulty: Advanced ClaudeChatGPTCursor

Galera & Group Replication Cluster Health Prompt

Diagnose a multi-node MySQL cluster that is out of sync, stalling on flow control, or at risk of split-brain

Target user
Database reliability engineers and DBAs running Galera or MySQL Group Replication
Difficulty
Advanced
Tools
Claude, ChatGPT, Cursor

The prompt

You are a senior MySQL DBA who has run Galera and MySQL 8.0 Group Replication clusters in production and has been paged to triage a cluster that is degraded — a node has fallen out of sync, writes are stalling, and the on-call team is worried about split-brain. Work only from the evidence I provide and never recommend a blind restart or `SET GLOBAL wsrep_provider_options` change without first explaining the blast radius.

I will provide:
- The cluster topology and replication flavor: [DESCRIBE Galera vs Group Replication, node count, single-primary vs multi-primary]
- `SHOW STATUS LIKE 'wsrep%';` from each reachable node (Galera): [PASTE]
- `SELECT * FROM performance_schema.replication_group_members;` and `replication_group_member_stats` (Group Replication): [PASTE]
- Recent error log excerpts mentioning flow control, certification, or state transfer (SST/IST): [PASTE]
- The symptom and timeline: [DESCRIBE what changed, when writes started stalling, any recent network or DDL events]

Walk through the diagnosis in order:
1. **Establish quorum and membership.** From `wsrep_cluster_size`, `wsrep_cluster_status` (Primary/Non-Primary), and the `replication_group_members` MEMBER_STATE column (ONLINE/RECOVERING/ERROR/UNREACHABLE), determine which nodes form the primary component and whether any node is partitioned — the split-brain risk.
2. **Read each node's local state.** Interpret `wsrep_local_state_comment` (Synced, Donor/Desynced, Joiner, Joined) or the GR member role/state to identify the out-of-sync node and whether it is mid-transfer.
3. **Quantify flow control.** Use `wsrep_flow_control_paused`, `wsrep_flow_control_sent`, and the GR `COUNT_TRANSACTIONS_IN_QUEUE` / queue size to decide whether a slow applier is throttling the whole cluster, and which node is the bottleneck.
4. **Inspect certification and conflicts.** Correlate `wsrep_local_cert_failures` and `wsrep_local_bf_aborts` (or GR certification queue) with the workload to spot hot-row or multi-primary write conflicts.
5. **Recommend the recovery path.** Choose between letting IST/SST complete, gracefully removing the lagging node, or bootstrapping — and state the exact order of operations and which node is safe to act on first.

Output: a prioritized findings table (node, state, role in the problem, risk), a root-cause statement, and a numbered recovery runbook with the specific commands per node and a rollback note for each step.

Guardrails: validate every recovery step against a replica or staging cluster of the same flavor and version before touching production, take a fresh backup or confirm a recent verified backup exists before any bootstrap or forced membership change, and never force a primary component (`pc.bootstrap`) on more than one node.

Why this prompt works

Cluster health incidents on Galera and Group Replication are dangerous precisely because the wrong instinct — restart the slow node, or bootstrap whatever is in front of you — can turn a recoverable degradation into permanent data divergence. This prompt forces the model into the same disciplined order a seasoned DBA uses: establish quorum and membership before touching anything, identify which node is actually the bottleneck, and only then propose a recovery path. By demanding the real status output from every reachable node rather than a single snapshot, it prevents the model from reasoning about a multi-node system as if it were one server.

The numbered steps map directly to the diagnostic surfaces that matter. wsrep_cluster_status and replication_group_members.MEMBER_STATE reveal split-brain risk; wsrep_local_state_comment and the GR member role distinguish a node that is mid-transfer from one that is genuinely stuck; and the flow-control counters separate “one slow applier is throttling everyone” from “the network partitioned.” Tying certification failures and brute-force aborts back to the workload is what catches hot-row contention in multi-primary setups, which is otherwise invisible in lag metrics.

Finally, the guardrails reflect how these systems actually break operators. SST cloning saturates I/O, a forced primary component on two nodes is the textbook split-brain, and a bootstrap without a verified backup leaves no way back. Requiring staging validation and a per-step rollback note keeps the output usable under pressure rather than a list of commands someone pastes blindly into a production primary at 3 a.m.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week