RabbitMQ Memory & Disk Alarm Resource-Limit Triage Prompt
Triage a RabbitMQ memory or disk-free alarm that has blocked publishers cluster-wide, find what is consuming the resource, and recover safely without dropping messages.
- Target user
- Platform and SRE engineers responding to RabbitMQ resource alarms
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior platform engineer who has recovered RabbitMQ nodes from memory and disk-free alarms that blocked every publisher. Walk me through triaging mine. I will provide: - `rabbitmqctl status` showing the memory and disk_free sections and current alarm state [PASTE OUTPUT] - `rabbitmqctl list_queues name messages memory message_bytes_ram` sorted by memory [PASTE OUTPUT] - The configured watermarks: `vm_memory_high_watermark`, `disk_free_limit` [PASTE OUTPUT] - The trigger context: traffic spike, slow/absent consumers, a big backlog, log growth [DESCRIBE] Your job: 1. **Confirm which alarm fired** — memory high-watermark vs disk-free limit; explain that EITHER blocks all publishers cluster-wide while consumers keep draining, which is the intended backpressure, not a crash. 2. **Find the consumer of the resource** — large queues holding messages in RAM, an unbounded backlog from missing consumers, big binary message bodies, mnesia/metadata, or (for disk) message store plus logs. Use `list_queues` memory columns and `status` breakdown to point at the cause. 3. **Recover safely** — the priority is to let consumers drain or to add consumers so the queue shrinks and the alarm clears; explain when raising the watermark briefly is acceptable as breathing room and when it just delays a crash. 4. **Reduce footprint** — lazy queues / queue mode to keep messages on disk instead of RAM, bounded queues via max-length, and quorum-queue memory behavior; for disk, clear/rotate logs and check the message store. 5. **Prevent recurrence** — right-size watermarks, alert on memory/disk headroom and alarm state ahead of the limit, and fix the consumer or backlog pattern that caused it. Output as: (a) which alarm and what it's blocking, (b) the resource hog identified from the data, (c) the safe recovery sequence, (d) the config and alerts to prevent a repeat. Recovery should favor draining over deletion. Do not purge or delete queues to clear an alarm without review — that destroys messages; on prod, raising the watermark or adding consumers to drain is almost always the safer first move, validated against a staging broker where possible.
Why this prompt works
A RabbitMQ resource alarm feels like an outage because publishers everywhere suddenly block, but the prompt reframes it correctly: a memory high-watermark or disk-free alarm is intentional cluster-wide backpressure, with consumers still draining underneath. Knowing that the alarm is a safety mechanism, not a crash, changes the response from panic-restart to deliberate recovery — and it’s why the prompt’s first step is simply identifying which alarm fired and what it’s protecting.
It then drives toward the actual resource hog using the right data: the per-queue memory and message_bytes_ram columns, the status breakdown, and the watermark settings. Most alarms trace back to a queue holding a large backlog in RAM because consumers are slow or gone, or to disk filled by the message store and logs. Pinpointing that lets the recovery target the cause instead of flailing. The structural fixes — lazy queues to keep messages on disk, bounded queues via max-length, log rotation for disk — address footprint rather than just the symptom.
The guardrails encode the safe order of operations under pressure. Restarting a node mid-alarm can lose non-persistent messages and usually doesn’t fix anything; purging a queue clears the alarm by destroying real data. By favoring draining, adding consumers, or a temporary watermark bump as breathing room, and by adding headroom alerts that fire before the limit, the prompt turns a recurring fire drill into a controlled recovery with early warning.