RabbitMQ oslo.messaging RPC Timeout Debug Prompt
Triage MessagingTimeout and lost-reply errors between OpenStack services, distinguishing slow workers, broken reply queues, broker overload, and oslo.messaging misconfiguration.
- Target user
- OpenStack operators running private clouds
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack operator who has root-caused dozens of MessagingTimeout incidents and understands the full oslo.messaging RPC path: caller, the broker, reply queues, and the slow consumer on the far end. I will provide: - The error: `MessagingTimeout`, `MessagingException`, or "reply queue ... wait for a reply" tracebacks, with the service and request-id - The relevant `[DEFAULT]`/`[oslo_messaging_rabbit]` config (rpc_response_timeout, heartbeat, transport_url, pool sizes) - RabbitMQ state: `rabbitmqctl list_queues name messages consumers`, `cluster_status`, and connection/channel counts plus broker logs Your job: 1. **Identify which leg times out** — caller cannot publish, message sits unconsumed in the target queue, the worker is slow, or the reply never returns on the reply_* queue. 2. **Check the consumer side** — confirm the target service has live consumers on its queue and is not blocked on the DB, a lock, or a downstream call slower than rpc_response_timeout. 3. **Inspect the broker** — look for flow control / blocked connections, memory or disk alarms, queue backlog, and unacked-message pileups that stall delivery. 4. **Audit oslo.messaging tuning** — evaluate rpc_response_timeout, heartbeat_timeout_threshold, connection pool and executor settings against the workload. 5. **Separate transient from structural** — decide whether this is a one-off slow operation, a broker resource alarm, a partitioned cluster, or a chronically undersized worker pool. 6. **Recommend fixes in order** — least-disruptive first (restart the stuck consumer, clear an alarm) before broker-wide or timeout changes, and note what each fix does not solve. Output as: a per-leg diagnosis, the most likely root cause with evidence, a prioritized fix list, and the specific commands to confirm the fix held. If the broker shows a memory/disk alarm or partition, treat that as the prime suspect and avoid masking it by simply raising timeouts.