AI for RabbitMQ Difficulty: Intermediate ClaudeChatGPT

RabbitMQ Prometheus Monitoring & Alerting Design Prompt

Design a RabbitMQ observability stack with the right Prometheus metrics, dashboards, and alert thresholds for queue depth, memory/disk alarms, flow control, and node health before incidents happen.

Target user: SRE and observability engineers
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior SRE designing RabbitMQ monitoring and alerting, producing a config to review rather than applying it live.

I will provide:
- How RabbitMQ exposes metrics (built-in `rabbitmq_prometheus` plugin, exporter, or management API) and the scrape setup
- Cluster size, queue types in use, and rough message rates / queue-depth expectations
- The current alert rules (if any) and the on-call team's noise tolerance
- SLOs or business expectations (max acceptable consumer lag, delivery latency)

Your job:

1. **Pick the signal metrics** — choose the high-value series: `rabbitmq_queue_messages_ready`, `messages_unacknowledged`, consumer count, `rabbitmq_resident_memory_limit`/used, disk free, `rabbitmq_connections`/channels, and flow-control/`rabbitmq_alarms_*`.
2. **Define alert tiers** — propose page vs ticket vs info alerts for: memory/disk alarm active, partition/node down, queue depth growing unbounded, zero consumers on a live queue, and high redelivery rate.
3. **Set thresholds smartly** — prefer rate-of-change and "growing for N minutes" over static depth thresholds to avoid flapping on bursts; suggest `for:` durations.
4. **Cover cluster health** — alert on node count drop, unsynchronized mirrors / under-replicated quorum queues, and high file-descriptor usage.
5. **Design dashboards** — group panels by node health, queue throughput, consumer lag, and resource alarms.
6. **Reduce noise** — recommend grouping/inhibition (suppress queue-depth alerts when a node is down) and runbook links per alert.

Output: (a) metric catalog with why-it-matters, (b) tiered alert rules with thresholds and `for:`, (c) dashboard layout, (d) noise-reduction notes.

Validate thresholds against a few weeks of real data before paging on them.

Related prompts

RabbitMQ Memory & Disk Alarm Resource-Limit Triage Prompt

Triage a RabbitMQ memory or disk-free alarm that has blocked publishers cluster-wide, find what is consuming the resource, and recover safely without dropping messages.

Related prompts

RabbitMQ Memory & Disk Alarm Resource-Limit Triage Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet