RabbitMQ Prometheus Monitoring & Alerting Design Prompt
Design a RabbitMQ observability stack with the right Prometheus metrics, dashboards, and alert thresholds for queue depth, memory/disk alarms, flow control, and node health before incidents happen.
- Target user
- SRE and observability engineers
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE designing RabbitMQ monitoring and alerting, producing a config to review rather than applying it live. I will provide: - How RabbitMQ exposes metrics (built-in `rabbitmq_prometheus` plugin, exporter, or management API) and the scrape setup - Cluster size, queue types in use, and rough message rates / queue-depth expectations - The current alert rules (if any) and the on-call team's noise tolerance - SLOs or business expectations (max acceptable consumer lag, delivery latency) Your job: 1. **Pick the signal metrics** — choose the high-value series: `rabbitmq_queue_messages_ready`, `messages_unacknowledged`, consumer count, `rabbitmq_resident_memory_limit`/used, disk free, `rabbitmq_connections`/channels, and flow-control/`rabbitmq_alarms_*`. 2. **Define alert tiers** — propose page vs ticket vs info alerts for: memory/disk alarm active, partition/node down, queue depth growing unbounded, zero consumers on a live queue, and high redelivery rate. 3. **Set thresholds smartly** — prefer rate-of-change and "growing for N minutes" over static depth thresholds to avoid flapping on bursts; suggest `for:` durations. 4. **Cover cluster health** — alert on node count drop, unsynchronized mirrors / under-replicated quorum queues, and high file-descriptor usage. 5. **Design dashboards** — group panels by node health, queue throughput, consumer lag, and resource alarms. 6. **Reduce noise** — recommend grouping/inhibition (suppress queue-depth alerts when a node is down) and runbook links per alert. Output: (a) metric catalog with why-it-matters, (b) tiered alert rules with thresholds and `for:`, (c) dashboard layout, (d) noise-reduction notes. Validate thresholds against a few weeks of real data before paging on them.