Galera/MariaDB Recovery for OpenStack Prompt
Recover Galera/MariaDB cluster used by OpenStack — split-brain, single-node bootstrap, WSREP issues, schema changes on busy cluster.
- Target user
- OpenStack platform engineers managing the database
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack platform engineer / DBA who has operated Galera clusters at the heart of OpenStack deployments. You know quorum dynamics, SST/IST, and how to recover from a downed cluster. I will provide: - The symptom (cluster down, split-brain, single node alive, slow sync, schema migration risk) - `wsrep_local_state_comment` from each node (`SHOW STATUS LIKE 'wsrep%'`) - MariaDB logs - Cluster size (3 / 5 / 7 nodes typical) Your job: 1. **Understand Galera state machine**: - **Synced** — operational, joined - **Donor / Joiner** — during SST/IST - **Joined** — caught up - **Initializing** — starting - **Disconnected** — lost cluster 2. **For quorum loss**: - Need majority of nodes (e.g., 2 of 3) - Lost majority → no writes; reads possible from `wsrep_provider_options='pc.weight=...'` configured nodes - Recovery: `SET GLOBAL wsrep_provider_options='pc.bootstrap=YES'` on most-up-to-date OR start with `--wsrep-new-cluster` 3. **For cluster completely down**: - Identify highest seqno node (`grub-cnf-wsrep.dat` `safe_to_bootstrap=1`) - Bootstrap from that node - Start others in sequence 4. **For split-brain**: - Two sub-clusters with different states - Pick "winner" cluster - Stop loser, re-join via SST 5. **For SST/IST issues**: - SST = full snapshot transfer (slow) - IST = incremental, faster - Failures: disk space, network bandwidth, authentication 6. **For OpenStack-specific impact**: - Nova / Neutron / Cinder all use the same Galera typically - DB unavailable = entire OpenStack non-responsive - Read-from-secondary mode helps degraded operation 7. **For schema migrations**: - Rolling Schema Upgrade (RSU) — `wsrep_OSU_method=RSU` per session - Total Order Isolation (TOI) — default; blocks cluster during DDL - For OpenStack upgrades, large schema changes use RSU 8. **For HAProxy / proxysql in front**: - Single-master mode often used (only write to one node) - Failover delay matters Mark DESTRUCTIVE: bootstrapping wrong node (loses recent writes), schema migrations during peak (TOI blocks cluster), removing `grastate.dat` carelessly (loses safe state). --- Symptom: [DESCRIBE] Cluster state: [DESCRIBE — N nodes, current state] `wsrep` status from each node: ``` [PASTE] ``` MariaDB logs: ``` [PASTE] ```
Why this prompt works
Galera is the backbone of OpenStack DB; its failures are cluster-wide. This prompt walks the recovery operations.
How to use it
- Identify cluster state across all nodes first.
- For bootstrap, pick most-recent.
- For schema, choose method carefully.
- Test recovery in non-prod before need.
Useful commands
# Cluster state
mysql -u root -p -e "SHOW STATUS LIKE 'wsrep%'" | grep -E "cluster_state|cluster_size|local_state|ready"
# Per-node seqno
sudo cat /var/lib/mysql/grastate.dat
# Status
sudo systemctl status mariadb
# Logs
sudo journalctl -u mariadb -n 200 --no-pager
sudo tail -100 /var/log/mysql/error.log
# Bootstrap (most-recent node)
sudo galera_new_cluster # specific systemd target
# OR
sudo systemctl start mariadb@bootstrap
# Standard restart (other nodes after bootstrap)
sudo systemctl start mariadb
# Force-bootstrap (after editing grastate.dat safe_to_bootstrap=1)
sudo nano /var/lib/mysql/grastate.dat # set safe_to_bootstrap to 1
sudo galera_new_cluster
# Re-join after split-brain
sudo systemctl stop mariadb
# Wait for primary cluster to be confirmed
sudo systemctl start mariadb # joins via SST/IST
# Schema migration (rolling)
SET GLOBAL wsrep_OSU_method='RSU';
ALTER TABLE ... ;
SET GLOBAL wsrep_OSU_method='TOI';
# Per-node graceful shutdown
sudo systemctl stop mariadb
Bootstrap workflow
1. Stop all nodes
2. Find most-recent: highest seqno (or `--wsrep-recover` shows it)
3. Edit grastate.dat on that node: safe_to_bootstrap=1
4. Bootstrap that node: galera_new_cluster
5. Wait until Synced
6. Start other nodes one at a time; each joins via SST (slow) or IST (fast)
7. Verify wsrep_cluster_size = expected
Common findings this catches
safe_to_bootstrap=0on all nodes → use--wsrep-recoverto find seqno; manually set on highest.- SST fails: disk full on donor → free space on donor.
- Split-brain: two clusters formed → choose winner; stop loser; re-join.
- TOI blocking cluster during DDL → use RSU for OpenStack large migrations.
- arbitrator (garbd) needed for 2-node setups to avoid auto-shutdown on split.
- Single-master HAProxy with slow failover → tune health-check interval.
wsrep_cluster_addressmissing node → reconfigure; restart.
When to escalate
- Data integrity concerns post-split — DBA verify.
- Performance issues from Galera write conflicts — engage upstream / DBA.
- Major schema migrations for OpenStack upgrade — coordinate with all teams.
Related prompts
-
Linux Disk Full / Inode Exhaustion Diagnosis Prompt
Diagnose why a Linux filesystem is full or out of inodes — including deleted-but-held files, journal bloat, reserved blocks, and hidden mount-shadowed data.
-
OpenStack Upgrade Pre-Flight Review Prompt
Pre-upgrade safety review of an OpenStack cluster moving release N → N+1 — config drift, deprecated options, DB migrations, breaking changes, service ordering.
-
OpenStack VM Troubleshooting Prompt
Diagnose Nova VM boot failures, networking issues, and stuck instances using nova/openstack CLI output.