Troubleshooting RabbitMQ in OpenStack

Every OpenStack service talks to every other service through RabbitMQ. The scheduler, the conductor, the compute agents, the network agents — they all coordinate over the message bus via oslo.messaging. Which means when RabbitMQ has a bad day, the entire cloud has a bad day, and the symptoms are maddeningly indirect: instances stuck building, volumes stuck attaching, agents reporting down even though their processes are running.

After years of running OpenStack, I’ve learned that “the cloud is mysteriously slow and half-broken” is RabbitMQ until proven otherwise. Here’s how I prove it.

The tell-tale symptom pattern

RabbitMQ trouble has a signature: multiple unrelated services degrade at once. Nova can’t schedule, Neutron agents flap to down, Cinder operations hang — all simultaneously, with no single service’s logs showing a clear root cause. When I see that pattern, I go straight to the message bus before reading any service log.

Step 1: Check cluster and queue health

rabbitmqctl cluster_status
rabbitmqctl list_queues name messages messages_ready messages_unacknowledged consumers

Two numbers tell the story:

messages_ready climbing — messages are arriving but nothing is consuming them. A consumer (an OpenStack service) is wedged or disconnected.
messages_unacknowledged high and stuck — a consumer took messages but never acked them, usually because it’s hung mid-processing.

Sort by message count and the worst queue points you straight at the suffering service: queue names map to OpenStack components (nova, q-agent-notifier, cinder-volume).

Step 2: Hunt for partitions and node failures

A clustered RabbitMQ that suffered a network blip can split-brain into a partition, and OpenStack handles that badly:

rabbitmqctl cluster_status | grep -A5 partitions

If you see partitions listed, that’s your incident. A partitioned cluster has nodes that disagree about queue state, and oslo.messaging clients get inconsistent behavior depending on which node they hit. The recovery is to decide on an authoritative node and restart the minority partition members to rejoin cleanly.

This is also why your cluster_partition_handling strategy matters — pause_minority prevents the worst split-brain corruption at the cost of availability. Set it deliberately.

Step 3: Find the stuck consumer

When a queue has messages_ready piling up but consumers > 0, the consumer is connected but not processing. Identify which service:

rabbitmqctl list_consumers | grep <queue-name>

Then on the implicated service host, the fix is usually a restart of the OpenStack service — its oslo.messaging connection has gone stale after a broker blip and won’t recover on its own. This is the single most common RabbitMQ-related fix in OpenStack: a nova-compute or neutron agent whose connection died silently, leaving it “running” but deaf.

systemctl restart neutron-openvswitch-agent   # the wedged consumer

Watch the queue drain after the restart — messages_ready falling confirms you fixed the right thing.

Step 4: Memory and disk alarms

RabbitMQ blocks publishers when it hits memory or disk watermarks, which freezes the whole cloud’s coordination:

rabbitmqctl status | grep -A3 -iE 'memory|disk_free'
rabbitmqctl list_queues name messages | sort -k2 -n -r | head

If RabbitMQ tripped its high-memory watermark, publishers are blocked and OpenStack appears totally frozen. The cause is usually an unbounded queue from a consumer that died days ago — durable messages accumulating with nobody reading them. Find the giant queue, fix or purge it, and consider setting a queue length limit and TTL so a dead consumer can’t take down the broker again.

Step 5: Tune oslo.messaging for resilience

A lot of RabbitMQ pain is really oslo.messaging configuration. In each service’s config:

[oslo_messaging_rabbit]
rabbit_ha_queues = true
heartbeat_timeout_threshold = 60
amqp_durable_queues = false

The heartbeat setting matters most: without sane heartbeats, a half-dead TCP connection lingers and the service keeps trying to use a dead channel. Proper heartbeats let clients detect a broken broker connection and reconnect — which prevents the “running but deaf” consumer in the first place.

Using AI to correlate the cloud-wide symptoms

RabbitMQ incidents are confusing precisely because the symptoms are scattered across many services. That’s where an LLM helps: I paste the list_queues output, the cluster_status, and a few of the affected services’ log tails and ask:

“Here is the RabbitMQ queue list, cluster status, and log tails from Nova, Neutron, and Cinder, all degraded at once. Tell me whether this is a partition, a memory/disk alarm, or stuck consumers, identify which service to restart, and give the read-only command to confirm before I act.”

It’s quick at recognizing “all three services degraded + climbing messages_ready + a partition listed = split-brain, not a service bug,” which is exactly the leap a tired engineer misses at 2am. I keep these messaging-triage prompts with my other OpenStack prompts.

Make the bus boring

The clouds that didn’t wake me up had three things: RabbitMQ memory/disk and messages_ready alerting in Prometheus, a deliberate cluster_partition_handling policy, and sane oslo.messaging heartbeats on every service. Get those right and RabbitMQ fades into the background where it belongs.

When the whole cloud feels broken at once, check the message bus first — it’s faster than reading six services’ logs hoping one confesses. For more messaging and operations prompts, browse our prompt library.

AI triage of messaging incidents is assistive, not authoritative. Confirm queue and cluster state yourself before restarting services or purging queues.