RabbitMQ Queue Backpressure & Flow-Control Triage Prompt
Diagnose why a RabbitMQ queue is backing up and producers are being throttled, and decide whether the bottleneck is slow consumers, flow control, or a resource alarm.
- Target user
- Platform and SRE engineers triaging RabbitMQ throughput incidents
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior platform engineer who has triaged many RabbitMQ backpressure incidents where queues grow without bound and publishers stall. Walk me through diagnosing mine. I will provide: - `rabbitmqctl list_queues name messages messages_ready messages_unacknowledged consumers consumer_utilisation` [PASTE OUTPUT] - Connection state showing flow control: `rabbitmqctl list_connections name state` and channel `list_channels` [PASTE OUTPUT] - Any resource alarms: `rabbitmqctl status` memory/disk alarm section, and `list_queues memory` [PASTE OUTPUT] - Symptoms: publishers slow/blocked, growing queue depth, rising latency [DESCRIBE] Your job: 1. **Locate the bottleneck** — separate "queue growing because consumers are slow/absent" (high `messages_ready`, low `consumer_utilisation`) from "broker is throttling producers" (connections in `flow` state) from "a memory or disk alarm has blocked all publishers." 2. **Read the signals correctly** — explain `messages_ready` vs `messages_unacknowledged` (unacked = consumers holding too much via prefetch), `consumer_utilisation` near 1.0 meaning consumers are the limit, and connection `flow` state meaning internal credit-based flow control is engaged. 3. **Trace causes** — slow downstream dependency, too few consumers, prefetch too low (consumers idle waiting) or too high (one consumer hoards), large unacked backlog from a stuck consumer, or memory/disk watermark crossed. 4. **Recommend fixes** — scale or speed consumers, tune prefetch/QoS, add a lazy queue or set a max-length with overflow policy, fix the resource alarm, or apply backpressure deliberately at the producer with publisher confirms. 5. **Prevent recurrence** — what to alert on (queue depth trend, `messages_unacknowledged`, connections in flow, alarm state) so this is caught before publishers block. Output as: (a) the diagnosed bottleneck with the specific metric that proves it, (b) immediate mitigation, (c) root-cause fix, (d) the alerts to add. Validate any queue-policy or prefetch change on a staging broker before prod. Do not purge a backed-up queue to "relieve pressure" without review — purging discards real messages and hides the actual cause.
Why this prompt works
Backpressure incidents are confusing because three different mechanisms produce similar symptoms: slow consumers, RabbitMQ’s internal credit-based flow control, and resource alarms that block publishers outright. The prompt forces you to distinguish them using the exact metrics that tell them apart — messages_ready versus messages_unacknowledged, consumer_utilisation, and connection flow state — rather than guessing. That distinction changes the fix entirely: scaling consumers does nothing if the real problem is a disk-free alarm that has blocked every publisher.
It encodes the right mental model of unacked messages. A large messages_unacknowledged count usually means consumers have pulled work via prefetch but aren’t acking it — either because they’re slow, stuck, or prefetch is set too high and one consumer is hoarding. Reading that signal correctly is what separates a five-minute fix from an hour of restarting random services.
The guardrails address the most common harmful reflex during a backpressure incident: purging the queue to make the number go down. That destroys real messages and erases the evidence of what caused the backup. By steering toward staging validation, deliberate producer-side backpressure with publisher confirms, and the right alerts, the prompt turns a panic into a diagnosis.