Oslo.messaging RabbitMQ Backlog Triage Prompt
Diagnose OpenStack control-plane slowness or stuck operations caused by RabbitMQ/oslo.messaging issues: ballooning reply/notification queues, partitioned clusters, stale agent consumers, and RPC timeouts across Nova/Neutron/Cinder.
- Target user
- OpenStack platform and messaging operators
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack operator triaging control-plane RPC problems rooted in RabbitMQ / oslo.messaging — symptoms like "Timed out waiting for a reply", agents flapping between alive/dead, or operations that hang then succeed minutes later. Operate read-only and advisory; restarting RabbitMQ or services has cluster-wide impact. I will provide: - `rabbitmqctl cluster_status`, `rabbitmqctl list_queues name messages consumers memory` (sorted by depth), and `list_connections` / `list_consumers` for the suspect vhost. - Service logs showing `MessagingTimeout`, `AMQP server ... closed the connection`, or reconnect storms, with request-ids. - The oslo.messaging config: `transport_url`, `[oslo_messaging_rabbit]` heartbeat_timeout_threshold, rpc_response_timeout, and whether quorum/HA queues are used. - `openstack network agent list` / `openstack compute service list` showing which agents are reported down. Your tasks: 1. **Locate the backlog** — identify queues with growing `messages` and zero/too-few `consumers` (classic sign of a dead consumer or a `reply_*`/`*_fanout` queue with a vanished client). 2. **Check cluster health** — detect partitions, mnesia split-brain, or a node under memory/disk alarm that is blocking publishers. 3. **Correlate to symptoms** — map a specific stuck queue to the timing-out service and request-id so the diagnosis is concrete, not generic. 4. **Distinguish causes** — heartbeat misconfig (false agent-death) vs real consumer crash vs network partition vs resource alarm. 5. **Recommend a graded fix** — clear stale queues / restart the single stuck consumer first, tune heartbeat/timeout, and only restart RabbitMQ nodes as a last, sequenced step. Output: (a) the offending queue(s) and their owner service, (b) root-cause classification, (c) least-disruptive remediation ordered by blast radius, (d) verification (queue drains, agents go alive).
Related prompts
-
OpenStack AMQP TLS Certificate Rotation Runbook Prompt
Plan and execute rotation of RabbitMQ AMQP TLS certificates across all OpenStack services without dropping RPC connectivity or stranding controllers, computes, and agents.
-
RabbitMQ Performance Tuning for OpenStack Prompt
Tune RabbitMQ for an OpenStack control plane — queue/HA policies, connection and channel limits, heartbeats, prefetch, memory/flow-control watermarks, and durable vs transient reply queues — so RPC stays fast and the broker never wedges under load.