An oslo.messaging MessagingTimeout — the canonical
rabbitmq rpc timeout openstack failure — means one service made an RPC call, waited up to
rpc_response_timeout (60 seconds by default), and never received its reply. The critical thing to
internalize before you touch anything: the service that logs the timeout is the caller. When
nova-api or cinder-api throws a MessagingTimeout, that service is usually
healthy — it asked a question and got silence back. The actual fault lives on the other side of the queue: the
callee that should have answered, the queue that isn’t being drained, or the
broker that dropped the connection.
That single distinction is what separates a five-minute fix from an hour of restarting the wrong things. An
openstack rpc timeout is not "the API is broken" — it is "the reply never came." So the workflow is
always the same: read the log to identify caller and callee, check RabbitMQ cluster health and the specific queue
that should carry the reply, and only then decide what to restart. This guide is the runbook I use for exactly that,
and it pairs with the deeper dives on troubleshooting RabbitMQ
in OpenStack and the MessagingTimeout error
reference. Want it in your hand during the incident? Grab the
free runbook pack above.
How RPC works in OpenStack
OpenStack services don’t call each other over HTTP for internal work — they use RPC over
oslo.messaging, which in almost every production deployment is backed by RabbitMQ
(AMQP 0-9-1). Every service is configured with a transport_url in its config
(e.g. rabbit://openstack:pass@ctrl1:5672,openstack:pass@ctrl2:5672,openstack:pass@ctrl3:5672/) that
tells it how to reach the broker and, critically, which nodes to fail over between.
On the wire, oslo.messaging uses topic exchanges to route requests to the right
service queue, and per-caller direct/reply queues to carry answers back. There are two call types:
cast— fire-and-forget. The caller publishes and moves on; no reply is expected. A lost cast fails silently, not with a timeout.call— request/reply. The caller publishes to the service topic, then blocks on its reply queue waiting for the answer. This is the one that producesMessagingTimeoutwhen the reply doesn’t arrive inrpc_response_timeout.
Underneath all of this, AMQP heartbeats keep each connection alive: the client and broker exchange small frames every few seconds so a dead peer is detected quickly. If a service is too busy to send heartbeats — or the network drops them — RabbitMQ decides the client is gone and closes the connection, taking any in-flight reply with it. That is the mechanism behind RabbitMQ missed heartbeats in OpenStack, and it’s a common, sneaky source of RPC timeouts.
Symptoms
You are probably here because you are seeing one or more of these:
MessagingTimeout: Timed out waiting for a reply to message ID ...in a service log — the oslo messaging timeout signature.- Operations hang then fail:
server createstuck inBUILD, volumes stuckcreating, Neutron ports stuckBUILD/DOWN, Heat stacks stuckCREATE_IN_PROGRESS. - RabbitMQ logs
missed heartbeats from client, timeout: 60s— the rabbitmq missed heartbeats openstack pattern — followed by connection churn. - Blocked connections: publishers stall and clients report the connection is
blocked, a sign the broker has hit a memory or disk watermark. - The failure spans multiple services — a classic openstack message queue timeout where Nova, Cinder, and Neutron all degrade at once because they share the broker.
Likely causes
An neutron cinder nova rpc timeout traces back to one of these, roughly in order of frequency:
- Dead or slow consumer (the callee). The service that should answer the
callis crashed, hung, or so overloaded it never publishes the reply. The caller times out. - Queue backlog with
consumers=0. Messages pile up on a service queue that has no attached consumer — the callee lost its subscription and nothing is draining the queue. - Missed heartbeats. An overloaded event loop, packet loss, or an aggressive
heartbeat_timeout_thresholdcauses RabbitMQ to close the connection mid-reply. - Blocked connections (backpressure). RabbitMQ hit its memory (
vm_memory_high_watermark) or disk (disk_free_limit) alarm and blocked publishers — see RabbitMQ queue backpressure and flow control. - Cluster partition. A network partition split the RabbitMQ cluster; queues become unavailable or inconsistent and RPCs time out unpredictably.
- Single-node
transport_url. Only one broker node is listed, so when it hiccups there is no failover and every RPC stalls. rpc_response_timeouttoo low for a genuinely slow operation (a huge scheduler pass or volume delete) — the least common real cause.
Immediate checks
Ninety seconds of triage tells you which side of the queue to chase. First, read the failing log and identify the caller and the callee — the timeout message names both the service that logged it and, usually, the target method/topic:
# The caller is whatever container logged this; note the target method/topic
docker logs --tail=120 nova_conductor 2>&1 | grep -Ei "MessagingTimeout|reply to message"
docker logs --tail=120 cinder_scheduler 2>&1 | grep -Ei "MessagingTimeout|reply to message"
# Typical line names the message ID and the unanswered call:
# MessagingTimeout: Timed out waiting for a reply to message ID abc123
# -> caller = this service, callee = the topic/host it was calling Whoever logs the timeout is the caller. The method or topic in the traceback tells you which callee (and therefore which queue) never answered.
Then confirm the broker itself is healthy before you blame any service, and find backed-up or consumer-less queues:
docker exec rabbitmq rabbitmqctl cluster_status
docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
| awk 'NR>1 && ($2>100 || $3==0) {print}' A partition in cluster_status changes everything (fix the broker first). Any queue with messages>100 or consumers==0 is your prime suspect.
Diagnostic commands
All read-only. Run these before restarting anything so you restart the right, narrowest component.
1. RabbitMQ cluster health
docker exec rabbitmq rabbitmqctl cluster_status
# Look for the "partitions" section — it must be empty
docker exec rabbitmq rabbitmqctl list_alarms # memory / disk / file-descriptor alarms
docker exec rabbitmq rabbitmqctl -q eval 'rabbit_node_monitor:partitions().' Any non-empty partitions list or a raised alarm means the broker is the fault. Do NOT blind-restart a partitioned cluster — jump to the fixes section.
2. Queue depth
docker exec rabbitmq rabbitmqctl list_queues name messages messages_ready consumers \
| awk 'NR>1 && ($2>100 || $4==0) {print}'
# messages = total in the queue
# messages_ready = waiting to be delivered (not yet acked)
# consumers = attached consumers; 0 means nothing is draining it messages climbing with consumers==0 = the callee lost its subscription. messages high with consumers>0 = the consumer is too slow, not absent.
3. Missed heartbeat checks
docker logs --tail=300 rabbitmq 2>&1 | grep -Ei "missed heartbeats from client"
# Each line carries the peer address, e.g.:
# missed heartbeats from client, timeout: 60s ... connection <0.123.0> (10.0.0.14:54432 -> ...)
# Map the peer_host back to the offending service:
docker exec rabbitmq rabbitmqctl list_connections peer_host name state \
| grep -Ei "10.0.0.14|blocked|flow" Correlate the peer_host in the heartbeat log with list_connections to name the service dropping heartbeats — that box is CPU- or network-starved.
4. Blocked connection checks
docker exec rabbitmq rabbitmqctl list_connections name peer_host state \
| grep -Ei "blocked|blocking"
docker exec rabbitmq rabbitmqctl status | grep -A6 -Ei "alarms|memory|disk_free" state = blocked/blocking means the broker hit a memory or disk watermark and paused publishers. This is backpressure — free resources, don't just restart.
5. Consumer / publisher checks
# Is anything consuming the queue the caller was talking to?
docker exec rabbitmq rabbitmqctl list_consumers | grep -Ei "cinder-scheduler|nova_conductor|q-l3-plugin"
docker exec rabbitmq rabbitmqctl list_connections name peer_host user state channels No consumer row for the expected queue confirms the callee isn't attached — restart the callee so it re-subscribes.
6. Kolla-Ansible RabbitMQ container checks
docker ps --filter name=rabbitmq --format 'table {{.Names}}\t{{.Status}}'
docker logs --tail=200 rabbitmq 2>&1 \
| grep -Ei "missed heartbeats|partition|closing|alarm|memory resource limit|disk" A rabbitmq container flapping (restart count climbing) or logging alarms is a broker-level fault, not a service-level one.
Fix & remediation steps
Use this decision tree to map what you found to the smallest safe action:
Broker partitioned or alarm raised?
-> YES: fix the broker FIRST. Do NOT blind-restart services or the cluster.
Recover the partition per policy, clear the memory/disk alarm, escalate.
-> NO: continue.
Target queue has consumers == 0?
-> YES: the callee lost its subscription. Restart the CALLEE service
(it re-subscribes and drains the backlog).
Queue backed up but consumers > 0?
-> The consumer is too slow, not absent. Check its CPU/load, scale it out
(add workers/replicas). Do NOT restart RabbitMQ.
Only missed heartbeats (no backlog, no alarm)?
-> Restart the affected service to get a clean connection. If it recurs,
tune heartbeat_timeout_threshold; check that host for CPU/network saturation.
Genuinely slow-but-healthy op (large delete/scheduler pass)?
-> Consider a deliberate, documented rpc_response_timeout bump for that call only. Work top-down. The broker check gates everything else — you never restart services while the cluster is partitioned.
Once you’ve identified the callee, restart the narrowest thing that re-subscribes it. In Kolla-Ansible these are single containers:
# Restart only the consumer that lost its subscription:
docker restart cinder_scheduler # Cinder RPC timeouts / volumes stuck creating
docker restart nova_conductor # Nova conductor timeouts / server create stuck BUILD
docker restart neutron_l3_agent # L3 RPC timeouts / routers + floating IPs down
docker restart heat_engine # Heat stacks stuck IN_PROGRESS
# RabbitMQ is the LAST resort, and only when the broker is the confirmed fault.
# Single node at a time — never the whole cluster at once:
docker restart rabbitmq Restart the service, then re-check the queue: consumers should return to >0 and messages should start draining toward 0.
See the sibling runbooks for the two most common callees: neutron-l3-agent dead / XXX state and Cinder scheduler timeout. For a queue that keeps refilling, the deeper walkthrough is diagnosing RabbitMQ queue buildup with AI.
Service-specific symptoms
Map the timeout to the affected service to jump straight to the prime-suspect queue:
| Service | What the timeout looks like | Prime suspect queue / target |
|---|---|---|
| Nova | Server create stuck in BUILD; nova-conductor logs MessagingTimeout | conductor / compute fanout (conductor ↔ compute RPC) |
| Cinder | Volume stuck creating; cinder-scheduler RPC timeout | cinder-scheduler / cinder-volume.<host> |
| Neutron | Ports stuck BUILD; agents show XXX; server logs RPC timeout to agents | q-l3-plugin / q-plugin (server ↔ agent RPC) |
| Heat | Stacks stuck CREATE_IN_PROGRESS; heat-engine RPC timeout | engine / engine_worker |
Grab the copy/paste version of this runbook
The RabbitMQ RPC Timeout Runbook Pack bundles every command on this page — cluster health, queue depth, heartbeat and blocked-connection checks, the Nova/Cinder/Neutron/Heat symptom matrix, and the service restart decision tree — in one print-ready PDF.
- OpenStack RPC timeout checklist
- RabbitMQ cluster + queue depth commands
- Consumer / publisher + heartbeat checks
- oslo.messaging config review
- Nova/Cinder/Neutron/Heat symptom matrix
- Service restart decision tree + notes template
No account needed · single opt-in · we never share your email.
Validation steps
Don’t declare victory on one successful call. Confirm the RPC path is genuinely healthy:
- Queues draining: the previously backed-up queue trends toward
messages ~0withconsumers > 0. - No new timeouts: tail the caller’s log — no fresh
MessagingTimeoutlines. - Real op per affected service: boot and delete a tiny instance, create and delete a small volume, create a test port, create and delete a trivial stack.
- No blocked connections and no raised alarms on the broker.
docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
| awk 'NR>1 && ($2>100 || $3==0) {print}' # should print nothing
docker exec rabbitmq rabbitmqctl list_connections state | grep -c blocked # expect 0
openstack compute service list -f value -c State | sort | uniq -c
openstack volume service list -f value -c Status | sort | uniq -c
openstack network agent list -f value -c Alive | sort | uniq -c An empty queue-filter output, zero blocked connections, and all services up/alive means the RPC path recovered.
Prevention
- Alert on the leading indicators, not the timeout: per-queue depth,
consumers==0on service queues,missed heartbeatsin the broker log, andblockedconnections. AMessagingTimeoutshould never be your first signal — see monitoring OpenStack with Prometheus. - List every broker node in
transport_urlon every service so a single-node hiccup fails over instead of stalling RPC. - Tune heartbeats sensibly — keep
heartbeat_timeout_thresholdgenerous enough for a busy event loop; never disable heartbeats to "fix" churn. - Right-size workers (
*_workers/ replicas) so consumers keep up under real concurrency and queues don’t back up. - Watch memory and disk watermarks so the broker never blocks publishers — the mechanics are in RabbitMQ backpressure and flow control.
- Use quorum queues where appropriate for durability and cleaner partition behavior than classic mirrored queues.
- Turn recurring incidents into reusable runbooks and use the free AI Incident Response assistant to draft triage steps fast.
Want the always-current prompts and tools behind this workflow? Browse the AI prompt library, the free in-browser DevOps tools, and — when a production incident needs senior hands — work with me directly.