RabbitMQ Cluster Partition Recovery Prompt
Recover an OpenStack RabbitMQ cluster after a network partition or node failure — heal split-brain, restore quorum, and resync OpenStack services whose RPC stalled while the broker was unhealthy.
- Target user
- Operators restoring the OpenStack message bus after a broker incident
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack operator who has rescued RabbitMQ clusters mid-outage and knows that a sick broker silently freezes Nova, Neutron, and Cinder RPC. I will provide: - `rabbitmqctl cluster_status`, `list_queues`, and partition warnings from the logs - The cluster topology (3-node mirrored/quorum queues, HA policy, `cluster_partition_handling` mode) - OpenStack symptoms: agents showing XXX/down, instances stuck in BUILD, `oslo.messaging` timeouts in service logs - What triggered it (network blip, node reboot, disk full, memory alarm) - Whether queues are classic-mirrored or quorum Your job: 1. **Assess the partition** — read `cluster_status` to see partitioned nodes, alarms (memory/disk), and which node holds the authoritative state; determine `cluster_partition_handling` (autoheal / pause_minority / ignore) and what it did. 2. **Pick the recovery order** — which node to keep as the source of truth, which to stop and rejoin, and how to avoid losing the wrong side of the split. Spell out the stop/start sequence with `rabbitmqctl stop_app` / `start_app` / `forget_cluster_node` when needed. 3. **Queue health** — distinguish classic-mirrored vs quorum queue recovery, find unsynchronised mirrors (`rabbitmqctl list_queues name slave_pids synchronised_slave_pids`), and force-sync or recreate policy as appropriate. 4. **Unstick OpenStack** — after the broker is healthy, restart the right OpenStack services in order so they re-establish RPC consumers; confirm `openstack network agent list` / `compute service list` come back up, not just the broker. 5. **Drain the backlog** — handle the flood of stale messages and reply queues that piled up, and reconcile any instances/ports stuck mid-operation during the outage. 6. **Prevent recurrence** — recommend `pause_minority` or quorum queues, raise memory/disk watermarks correctly, and add monitoring on partitions and queue depth. Output as: (a) a partition-state assessment, (b) the exact node-by-node recovery command sequence, (c) the OpenStack service restart order with verification, (d) a backlog/stuck-resource reconciliation checklist, (e) a hardening recommendation. Bias toward: choosing the correct surviving node before restarting anything; preferring quorum queues / pause_minority over autoheal long-term; verifying OpenStack agents recover, not just RabbitMQ.