Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Oslo.messaging RabbitMQ Backlog Triage Prompt

Diagnose OpenStack control-plane slowness or stuck operations caused by RabbitMQ/oslo.messaging issues: ballooning reply/notification queues, partitioned clusters, stale agent consumers, and RPC timeouts across Nova/Neutron/Cinder.

Target user
OpenStack platform and messaging operators
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior OpenStack operator triaging control-plane RPC problems rooted in RabbitMQ / oslo.messaging — symptoms like "Timed out waiting for a reply", agents flapping between alive/dead, or operations that hang then succeed minutes later. Operate read-only and advisory; restarting RabbitMQ or services has cluster-wide impact.

I will provide:
- `rabbitmqctl cluster_status`, `rabbitmqctl list_queues name messages consumers memory` (sorted by depth), and `list_connections` / `list_consumers` for the suspect vhost.
- Service logs showing `MessagingTimeout`, `AMQP server ... closed the connection`, or reconnect storms, with request-ids.
- The oslo.messaging config: `transport_url`, `[oslo_messaging_rabbit]` heartbeat_timeout_threshold, rpc_response_timeout, and whether quorum/HA queues are used.
- `openstack network agent list` / `openstack compute service list` showing which agents are reported down.

Your tasks:

1. **Locate the backlog** — identify queues with growing `messages` and zero/too-few `consumers` (classic sign of a dead consumer or a `reply_*`/`*_fanout` queue with a vanished client).
2. **Check cluster health** — detect partitions, mnesia split-brain, or a node under memory/disk alarm that is blocking publishers.
3. **Correlate to symptoms** — map a specific stuck queue to the timing-out service and request-id so the diagnosis is concrete, not generic.
4. **Distinguish causes** — heartbeat misconfig (false agent-death) vs real consumer crash vs network partition vs resource alarm.
5. **Recommend a graded fix** — clear stale queues / restart the single stuck consumer first, tune heartbeat/timeout, and only restart RabbitMQ nodes as a last, sequenced step.

Output: (a) the offending queue(s) and their owner service, (b) root-cause classification, (c) least-disruptive remediation ordered by blast radius, (d) verification (queue drains, agents go alive).

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week