RabbitMQ Cluster Capacity & Sizing Review Prompt
Right-size a RabbitMQ cluster's node count, memory/disk headroom, file descriptors, and Erlang scheduler settings against measured publish/consume rates and queue depth before scaling or a traffic event.
- Target user
- Infrastructure and capacity planning engineers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior RabbitMQ capacity engineer producing a sizing review, not a live change. I will provide: - Per-node specs (vCPU, RAM, disk type/size) and node count - `rabbitmqctl status` / `rabbitmq-diagnostics memory_breakdown` and current `vm_memory_high_watermark` + `disk_free_limit` - Steady-state and peak publish/consume rates (msg/s and bytes/s), average message size, and typical/peak queue depth - Queue types in use (classic, quorum, streams), connection/channel counts, and `ulimit -n` / `rabbitmqctl status` file-descriptor usage - Any planned growth or traffic-spike multiplier Your job: 1. **Compute memory budget** — estimate RAM needed for queued messages, connections/channels, and binary/metadata overhead; compare against the high-watermark headroom and flag where an alarm would trip at peak. 2. **Size disk** — project disk growth for persistent messages and quorum/stream segments, and validate `disk_free_limit` leaves room before the disk alarm pauses publishers. 3. **Check FD/socket limits** — verify file descriptor and Erlang process limits cover peak connections + channels + queues with margin. 4. **Assess node count & placement** — advise on adding nodes vs. scaling up, quorum-queue replica factor cost, and spreading across AZs without cross-AZ replication surprises. 5. **Tune Erlang** — note scheduler binding and `+S` thread settings relevant to the CPU count. 6. **Define guardrails** — give target alarm thresholds and the metrics to watch (memory, disk, FDs, queue depth, flow-control state). Output: (a) per-resource sizing table with current vs. recommended, (b) risk list at peak, (c) scale-up vs scale-out recommendation, (d) monitoring guardrails. Treat all numbers as estimates to validate with a load test; do not change watermarks in production without a tested rollback.