Diagnosing RabbitMQ Queue Buildup and Partitions in

The first time a RabbitMQ partition took down my control plane, I was three coffees deep and convinced the database was the problem. It wasn’t. It never is, when nova-compute agents start flapping and openstack server list hangs for forty seconds before timing out. After more than a decade of running OpenStack clouds, I’ve learned that the message bus is the quiet organ failure nobody monitors until it’s already septic. These days I keep an AI assistant in the loop while I triage, and it has genuinely sped me up. But I want to be honest about exactly what it’s good for and where it will happily lead you off a cliff.

Why the message bus is always the last suspect

OpenStack services don’t talk to each other directly. Nova, Neutron, Cinder, and the rest fan their RPC calls through RabbitMQ using oslo.messaging. When the bus slows down, every symptom looks like something else: instances stuck in BUILD, ports never going ACTIVE, agents showing XXX in openstack network agent list. The control plane looks healthy on the surface because the processes are still running — they’re just screaming into a queue nobody is draining.

So the first thing I do is look at queue depth directly:

rabbitmqctl list_queues name messages consumers messages_unacknowledged

If you see a queue with tens of thousands of messages and zero consumers, you’ve found your smoking gun. A healthy reply queue drains almost instantly. A pathological one grows monotonically. I’ll paste that table into an AI like Claude and ask it to flag the rows where consumers is zero but messages is climbing — it’s faster at eyeballing a 400-row dump than I am at 2 a.m. That’s the right use of it: a fast junior engineer who can scan output without getting bored. It is not a substitute for knowing what a healthy queue looks like.

Pro Tip: Pipe list_queues through sort -k2 -n and only the tail matters. The top of the list is noise; the buildup is always at the bottom.

Cluster status and the dreaded partition

The single most destructive RabbitMQ failure in OpenStack is a network partition — a “split brain” where cluster nodes can’t see each other and each half thinks it’s the survivor. Check it immediately:

rabbitmqctl cluster_status

Scroll to the partitions section. If it’s empty, good. If it lists nodes, you have a split. With the default pause_minority partition handling mode, the minority side deliberately pauses its queues to protect data — which is correct behavior but means consumers on that side go silent. I’ve watched engineers “fix” this by restarting the wrong node and destroying the only consistent copy of the queue state.

rabbitmqctl eval 'application:get_env(rabbit, cluster_partition_handling).'

If that returns autoheal instead of pause_minority, you’ve inherited a cluster that will silently pick a winner and discard messages from the loser. That’s a config decision, not a runtime one, and no AI should be making it for you. I will ask an assistant to explain the tradeoffs between pause_minority, autoheal, and ignore, and it gives a solid summary. I will not let it tell me which node to stonith. When I’ve debated cluster topology choices, I’ve found it more useful to work through them in a structured prompt workspace where I can keep the cluster facts pinned and force myself to state assumptions out loud.

Stale reply queues from nova and neutron

Here’s a failure mode that burned me for years before I understood it. oslo.messaging creates a reply_<uuid> queue per RPC client for direct replies. When a service crashes hard or a node is fenced uncleanly, those reply queues can be orphaned — they linger, sometimes with unacked messages, and they pile up:

rabbitmqctl list_queues name messages | grep '^reply_' | wc -l

If that count is in the thousands and growing, you have leaked reply queues. On older deployments without amqp_auto_delete semantics they don’t clean themselves up. The fix is usually a rolling restart of the affected services so they re-establish clean reply queues, but the diagnosis is what matters — and counting and grouping thousands of queue names is exactly the kind of tedious pattern-matching where I’ll lean on AI. I describe the symptom, paste a sample of queue names, and ask whether the pattern is consistent with orphaned reply queues. It’s right often enough to save me time and wrong often enough that I always confirm with my own eyes.

Pro Tip: A flood of reply_* queues right after a compute node reboot is almost always orphaned replies, not active traffic. Active reply queues have a consumer; orphans don’t.

oslo.messaging heartbeat timeouts and agents going down

When the bus is under load, the oslo.messaging heartbeat between a service and RabbitMQ starts missing. The service logs something like Timed out waiting for a reply to message ID ... or Too many heartbeats missed. The agent then declares itself disconnected, and in openstack network agent list or openstack compute service list it goes down even though the process is alive and the host is fine.

openstack network agent list --long
openstack compute service list

The trap is treating this as an agent problem. It’s a bus latency problem wearing an agent’s clothing. The relevant knobs live in each service’s config under the [oslo_messaging_rabbit] section — heartbeat_timeout_threshold and heartbeat_rate. If your bus is genuinely slow, raising the timeout just delays the inevitable; you need to fix the queue buildup. I keep a running set of these triage steps as reusable prompts, and for the message-bus-specific runbooks I’ve packaged the better ones into a prompt pack so I’m not rebuilding the same diagnostic checklist every incident.

Mirrored versus quorum queues

If you’re still running classic mirrored queues (ha-mode: all policies), you are running deprecated technology that handles partitions poorly and is a frequent buildup culprit. Modern OpenStack deployments should be on quorum queues, which use Raft and have far saner partition semantics. Check what you actually have:

rabbitmqctl list_queues name type

type of quorum is what you want for durable RPC queues. classic mirrored queues can end up with diverged replicas after a partition, and the “sync” that follows can itself stall the bus. When I’m planning a migration to quorum queues, AI is a decent rubber duck for drafting the policy changes — but I test every policy on a staging cluster first, because a wrong x-queue-type declaration on a queue that already exists will be silently ignored and you’ll think you migrated when you didn’t.

Where I draw the hard line with AI

Let me be blunt about the boundary. I will paste sanitized list_queues output, cluster_status topology, and anonymized log lines into an assistant. I will never hand it my clouds.yaml, my RabbitMQ admin credentials, the contents of /etc/rabbitmq/, or anything that grants standing access to the control plane. An AI is a fast, tireless, occasionally overconfident junior engineer. You would not give a junior the root password to production on day one, and the AI has been on the job for zero days, every conversation. Verify every command it suggests against the docs before you run it on a live bus. A rabbitmqctl command run on the wrong node during a partition can lose data permanently, and the model has no idea which node is the survivor — only you do.

For the actual incident coordination — paging, timeline, comms — I route through our incident response dashboard rather than improvising in a chat window, and I keep all my OpenStack messaging notes consolidated under the OpenStack category so the next partition is a thirty-minute incident instead of a three-hour one.

Conclusion

RabbitMQ buildup and partitions are the most underdiagnosed failure class in OpenStack, precisely because every symptom impersonates another subsystem. Start at queue depth, check cluster_status for partitions before you touch anything, hunt for leaked reply_* queues, and treat agent flapping as a bus-latency signal rather than an agent fault. AI will make you faster at scanning the output and reasoning about tradeoffs — but it stays firmly outside the credential boundary, and you verify everything it tells you. Speed without verification on a message bus is just a faster way to lose messages.

Diagnosing RabbitMQ Queue Buildup and Partitions in OpenStack with AI