You are a senior OpenStack platform engineer / DBA who has operated Galera clusters at the heart of OpenStack deployments. You know quorum dynamics, SST/IST, and how to recover from a downed cluster. I will provide: - The symptom (cluster down, split-brain, single node alive, slow sync, schema migration risk) - `wsrep_local_state_comment` from each node (`SHOW STATUS LIKE 'wsrep%'`) - MariaDB logs - Cluster size (3 / 5 / 7 nodes typical) Your job: 1. **Understand Galera state machine**: - **Synced** — operational, joined - **Donor / Joiner** — during SST/IST - **Joined** — caught up - **Initializing** — starting - **Disconnected** — lost cluster 2. **For quorum loss**: - Need majority of nodes (e.g., 2 of 3) - Lost majority → no writes; reads possible from `wsrep_provider_options='pc.weight=...'` configured nodes - Recovery: `SET GLOBAL wsrep_provider_options='pc.bootstrap=YES'` on most-up-to-date OR start with `--wsrep-new-cluster` 3. **For cluster completely down**: - Identify highest seqno node (`grub-cnf-wsrep.dat` `safe_to_bootstrap=1`) - Bootstrap from that node - Start others in sequence 4. **For split-brain**: - Two sub-clusters with different states - Pick "winner" cluster - Stop loser, re-join via SST 5. **For SST/IST issues**: - SST = full snapshot transfer (slow) - IST = incremental, faster - Failures: disk space, network bandwidth, authentication 6. **For OpenStack-specific impact**: - Nova / Neutron / Cinder all use the same Galera typically - DB unavailable = entire OpenStack non-responsive - Read-from-secondary mode helps degraded operation 7. **For schema migrations**: - Rolling Schema Upgrade (RSU) — `wsrep_OSU_method=RSU` per session - Total Order Isolation (TOI) — default; blocks cluster during DDL - For OpenStack upgrades, large schema changes use RSU 8. **For HAProxy / proxysql in front**: - Single-master mode often used (only write to one node) - Failover delay matters Mark DESTRUCTIVE: bootstrapping wrong node (loses recent writes), schema migrations during peak (TOI blocks cluster), removing `grastate.dat` carelessly (loses safe state). --- Symptom: [DESCRIBE] Cluster state: [DESCRIBE — N nodes, current state] `wsrep` status from each node: ``` [PASTE] ``` MariaDB logs: ``` [PASTE] ```

Why this prompt works

Galera is the backbone of OpenStack DB; its failures are cluster-wide. This prompt walks the recovery operations.

How to use it

Identify cluster state across all nodes first.
For bootstrap, pick most-recent.
For schema, choose method carefully.
Test recovery in non-prod before need.

Useful commands

# Cluster state
mysql -u root -p -e "SHOW STATUS LIKE 'wsrep%'" | grep -E "cluster_state|cluster_size|local_state|ready"

# Per-node seqno
sudo cat /var/lib/mysql/grastate.dat

# Status
sudo systemctl status mariadb

# Logs
sudo journalctl -u mariadb -n 200 --no-pager
sudo tail -100 /var/log/mysql/error.log

# Bootstrap (most-recent node)
sudo galera_new_cluster              # specific systemd target
# OR
sudo systemctl start mariadb@bootstrap

# Standard restart (other nodes after bootstrap)
sudo systemctl start mariadb

# Force-bootstrap (after editing grastate.dat safe_to_bootstrap=1)
sudo nano /var/lib/mysql/grastate.dat   # set safe_to_bootstrap to 1
sudo galera_new_cluster

# Re-join after split-brain
sudo systemctl stop mariadb
# Wait for primary cluster to be confirmed
sudo systemctl start mariadb           # joins via SST/IST

# Schema migration (rolling)
SET GLOBAL wsrep_OSU_method='RSU';
ALTER TABLE ... ;
SET GLOBAL wsrep_OSU_method='TOI';

# Per-node graceful shutdown
sudo systemctl stop mariadb

Bootstrap workflow

1. Stop all nodes
2. Find most-recent: highest seqno (or `--wsrep-recover` shows it)
3. Edit grastate.dat on that node: safe_to_bootstrap=1
4. Bootstrap that node: galera_new_cluster
5. Wait until Synced
6. Start other nodes one at a time; each joins via SST (slow) or IST (fast)
7. Verify wsrep_cluster_size = expected

Common findings this catches

safe_to_bootstrap=0 on all nodes → use --wsrep-recover to find seqno; manually set on highest.
SST fails: disk full on donor → free space on donor.
Split-brain: two clusters formed → choose winner; stop loser; re-join.
TOI blocking cluster during DDL → use RSU for OpenStack large migrations.
arbitrator (garbd) needed for 2-node setups to avoid auto-shutdown on split.
Single-master HAProxy with slow failover → tune health-check interval.
wsrep_cluster_address missing node → reconfigure; restart.

When to escalate

Data integrity concerns post-split — DBA verify.
Performance issues from Galera write conflicts — engage upstream / DBA.
Major schema migrations for OpenStack upgrade — coordinate with all teams.

Galera/MariaDB Recovery for OpenStack Prompt

Why this prompt works

How to use it

Useful commands

Bootstrap workflow

Common findings this catches

When to escalate

Related prompts

Linux Disk Full / Inode Exhaustion Diagnosis Prompt

OpenStack Upgrade Pre-Flight Review Prompt

OpenStack VM Troubleshooting Prompt

Why this prompt works

How to use it

Useful commands

Bootstrap workflow

Common findings this catches

When to escalate

Related prompts

Linux Disk Full / Inode Exhaustion Diagnosis Prompt

OpenStack Upgrade Pre-Flight Review Prompt

OpenStack VM Troubleshooting Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet