AI for RabbitMQ Difficulty: Advanced ClaudeChatGPT

RabbitMQ Cluster Partition Recovery Prompt

Recover an OpenStack RabbitMQ cluster after a network partition or node failure — heal split-brain, restore quorum, and resync OpenStack services whose RPC stalled while the broker was unhealthy.

Target user: Operators restoring the OpenStack message bus after a broker incident
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior OpenStack operator who has rescued RabbitMQ clusters mid-outage and knows that a sick broker silently freezes Nova, Neutron, and Cinder RPC.

I will provide:
- `rabbitmqctl cluster_status`, `list_queues`, and partition warnings from the logs
- The cluster topology (3-node mirrored/quorum queues, HA policy, `cluster_partition_handling` mode)
- OpenStack symptoms: agents showing XXX/down, instances stuck in BUILD, `oslo.messaging` timeouts in service logs
- What triggered it (network blip, node reboot, disk full, memory alarm)
- Whether queues are classic-mirrored or quorum

Your job:

1. **Assess the partition** — read `cluster_status` to see partitioned nodes, alarms (memory/disk), and which node holds the authoritative state; determine `cluster_partition_handling` (autoheal / pause_minority / ignore) and what it did.

2. **Pick the recovery order** — which node to keep as the source of truth, which to stop and rejoin, and how to avoid losing the wrong side of the split. Spell out the stop/start sequence with `rabbitmqctl stop_app` / `start_app` / `forget_cluster_node` when needed.

3. **Queue health** — distinguish classic-mirrored vs quorum queue recovery, find unsynchronised mirrors (`rabbitmqctl list_queues name slave_pids synchronised_slave_pids`), and force-sync or recreate policy as appropriate.

4. **Unstick OpenStack** — after the broker is healthy, restart the right OpenStack services in order so they re-establish RPC consumers; confirm `openstack network agent list` / `compute service list` come back up, not just the broker.

5. **Drain the backlog** — handle the flood of stale messages and reply queues that piled up, and reconcile any instances/ports stuck mid-operation during the outage.

6. **Prevent recurrence** — recommend `pause_minority` or quorum queues, raise memory/disk watermarks correctly, and add monitoring on partitions and queue depth.

Output as: (a) a partition-state assessment, (b) the exact node-by-node recovery command sequence, (c) the OpenStack service restart order with verification, (d) a backlog/stuck-resource reconciliation checklist, (e) a hardening recommendation.

Bias toward: choosing the correct surviving node before restarting anything; preferring quorum queues / pause_minority over autoheal long-term; verifying OpenStack agents recover, not just RabbitMQ.

Free: the DevOps AI Incident-Triage Cheat Sheet