Skip to content
CloudOps
Newsletter
All prompts
AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Galera/MariaDB Recovery for OpenStack Prompt

Recover Galera/MariaDB cluster used by OpenStack — split-brain, single-node bootstrap, WSREP issues, schema changes on busy cluster.

Target user
OpenStack platform engineers managing the database
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior OpenStack platform engineer / DBA who has operated Galera clusters at the heart of OpenStack deployments. You know quorum dynamics, SST/IST, and how to recover from a downed cluster.

I will provide:
- The symptom (cluster down, split-brain, single node alive, slow sync, schema migration risk)
- `wsrep_local_state_comment` from each node (`SHOW STATUS LIKE 'wsrep%'`)
- MariaDB logs
- Cluster size (3 / 5 / 7 nodes typical)

Your job:

1. **Understand Galera state machine**:
   - **Synced** — operational, joined
   - **Donor / Joiner** — during SST/IST
   - **Joined** — caught up
   - **Initializing** — starting
   - **Disconnected** — lost cluster
2. **For quorum loss**:
   - Need majority of nodes (e.g., 2 of 3)
   - Lost majority → no writes; reads possible from `wsrep_provider_options='pc.weight=...'` configured nodes
   - Recovery: `SET GLOBAL wsrep_provider_options='pc.bootstrap=YES'` on most-up-to-date OR start with `--wsrep-new-cluster`
3. **For cluster completely down**:
   - Identify highest seqno node (`grub-cnf-wsrep.dat` `safe_to_bootstrap=1`)
   - Bootstrap from that node
   - Start others in sequence
4. **For split-brain**:
   - Two sub-clusters with different states
   - Pick "winner" cluster
   - Stop loser, re-join via SST
5. **For SST/IST issues**:
   - SST = full snapshot transfer (slow)
   - IST = incremental, faster
   - Failures: disk space, network bandwidth, authentication
6. **For OpenStack-specific impact**:
   - Nova / Neutron / Cinder all use the same Galera typically
   - DB unavailable = entire OpenStack non-responsive
   - Read-from-secondary mode helps degraded operation
7. **For schema migrations**:
   - Rolling Schema Upgrade (RSU) — `wsrep_OSU_method=RSU` per session
   - Total Order Isolation (TOI) — default; blocks cluster during DDL
   - For OpenStack upgrades, large schema changes use RSU
8. **For HAProxy / proxysql in front**:
   - Single-master mode often used (only write to one node)
   - Failover delay matters

Mark DESTRUCTIVE: bootstrapping wrong node (loses recent writes), schema migrations during peak (TOI blocks cluster), removing `grastate.dat` carelessly (loses safe state).

---

Symptom: [DESCRIBE]
Cluster state: [DESCRIBE — N nodes, current state]
`wsrep` status from each node:
```
[PASTE]
```
MariaDB logs:
```
[PASTE]
```

Why this prompt works

Galera is the backbone of OpenStack DB; its failures are cluster-wide. This prompt walks the recovery operations.

How to use it

  1. Identify cluster state across all nodes first.
  2. For bootstrap, pick most-recent.
  3. For schema, choose method carefully.
  4. Test recovery in non-prod before need.

Useful commands

# Cluster state
mysql -u root -p -e "SHOW STATUS LIKE 'wsrep%'" | grep -E "cluster_state|cluster_size|local_state|ready"

# Per-node seqno
sudo cat /var/lib/mysql/grastate.dat

# Status
sudo systemctl status mariadb

# Logs
sudo journalctl -u mariadb -n 200 --no-pager
sudo tail -100 /var/log/mysql/error.log

# Bootstrap (most-recent node)
sudo galera_new_cluster              # specific systemd target
# OR
sudo systemctl start mariadb@bootstrap

# Standard restart (other nodes after bootstrap)
sudo systemctl start mariadb

# Force-bootstrap (after editing grastate.dat safe_to_bootstrap=1)
sudo nano /var/lib/mysql/grastate.dat   # set safe_to_bootstrap to 1
sudo galera_new_cluster

# Re-join after split-brain
sudo systemctl stop mariadb
# Wait for primary cluster to be confirmed
sudo systemctl start mariadb           # joins via SST/IST

# Schema migration (rolling)
SET GLOBAL wsrep_OSU_method='RSU';
ALTER TABLE ... ;
SET GLOBAL wsrep_OSU_method='TOI';

# Per-node graceful shutdown
sudo systemctl stop mariadb

Bootstrap workflow

1. Stop all nodes
2. Find most-recent: highest seqno (or `--wsrep-recover` shows it)
3. Edit grastate.dat on that node: safe_to_bootstrap=1
4. Bootstrap that node: galera_new_cluster
5. Wait until Synced
6. Start other nodes one at a time; each joins via SST (slow) or IST (fast)
7. Verify wsrep_cluster_size = expected

Common findings this catches

  • safe_to_bootstrap=0 on all nodes → use --wsrep-recover to find seqno; manually set on highest.
  • SST fails: disk full on donor → free space on donor.
  • Split-brain: two clusters formed → choose winner; stop loser; re-join.
  • TOI blocking cluster during DDL → use RSU for OpenStack large migrations.
  • arbitrator (garbd) needed for 2-node setups to avoid auto-shutdown on split.
  • Single-master HAProxy with slow failover → tune health-check interval.
  • wsrep_cluster_address missing node → reconfigure; restart.

When to escalate

  • Data integrity concerns post-split — DBA verify.
  • Performance issues from Galera write conflicts — engage upstream / DBA.
  • Major schema migrations for OpenStack upgrade — coordinate with all teams.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week