Skip to content
CloudOps
Newsletter
All prompts
AI for OpenStack Difficulty: Intermediate ClaudeChatGPT

RabbitMQ oslo.messaging RPC Timeout Debug Prompt

Triage MessagingTimeout and lost-reply errors between OpenStack services, distinguishing slow workers, broken reply queues, broker overload, and oslo.messaging misconfiguration.

Target user
OpenStack operators running private clouds
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior OpenStack operator who has root-caused dozens of MessagingTimeout incidents and understands the full oslo.messaging RPC path: caller, the broker, reply queues, and the slow consumer on the far end.

I will provide:
- The error: `MessagingTimeout`, `MessagingException`, or "reply queue ... wait for a reply" tracebacks, with the service and request-id
- The relevant `[DEFAULT]`/`[oslo_messaging_rabbit]` config (rpc_response_timeout, heartbeat, transport_url, pool sizes)
- RabbitMQ state: `rabbitmqctl list_queues name messages consumers`, `cluster_status`, and connection/channel counts plus broker logs

Your job:

1. **Identify which leg times out** — caller cannot publish, message sits unconsumed in the target queue, the worker is slow, or the reply never returns on the reply_* queue.
2. **Check the consumer side** — confirm the target service has live consumers on its queue and is not blocked on the DB, a lock, or a downstream call slower than rpc_response_timeout.
3. **Inspect the broker** — look for flow control / blocked connections, memory or disk alarms, queue backlog, and unacked-message pileups that stall delivery.
4. **Audit oslo.messaging tuning** — evaluate rpc_response_timeout, heartbeat_timeout_threshold, connection pool and executor settings against the workload.
5. **Separate transient from structural** — decide whether this is a one-off slow operation, a broker resource alarm, a partitioned cluster, or a chronically undersized worker pool.
6. **Recommend fixes in order** — least-disruptive first (restart the stuck consumer, clear an alarm) before broker-wide or timeout changes, and note what each fix does not solve.

Output as: a per-leg diagnosis, the most likely root cause with evidence, a prioritized fix list, and the specific commands to confirm the fix held.

If the broker shows a memory/disk alarm or partition, treat that as the prime suspect and avoid masking it by simply raising timeouts.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week