Skip to content
DevOps AI ToolKit
Newsletter
OpenStack Troubleshooting

RabbitMQ RPC Timeout in OpenStack: Fix Guide

A MessagingTimeout means a service published an RPC request and never got its reply in time. The service that logs the error is the caller — the fault is almost always the callee, the queue, or the broker. This runbook localizes which one across Nova, Cinder, Neutron, and Heat.

Updated July 3, 2026 12 min read Runbook-style guide · copy/paste commands

Free runbook · PDF

Download the free RabbitMQ RPC Timeout Runbook Pack

A copy/paste runbook for oslo.messaging timeouts and missed heartbeats — cluster health, queue depth, and a service restart decision tree.

  • OpenStack RPC timeout checklist
  • RabbitMQ cluster + queue depth commands
  • Consumer / publisher + heartbeat checks
  • oslo.messaging config review
  • Nova/Cinder/Neutron/Heat symptom matrix
  • Service restart decision tree + notes template

No account needed · single opt-in · we never share your email.

An oslo.messaging MessagingTimeout — the canonical rabbitmq rpc timeout openstack failure — means one service made an RPC call, waited up to rpc_response_timeout (60 seconds by default), and never received its reply. The critical thing to internalize before you touch anything: the service that logs the timeout is the caller. When nova-api or cinder-api throws a MessagingTimeout, that service is usually healthy — it asked a question and got silence back. The actual fault lives on the other side of the queue: the callee that should have answered, the queue that isn’t being drained, or the broker that dropped the connection.

That single distinction is what separates a five-minute fix from an hour of restarting the wrong things. An openstack rpc timeout is not "the API is broken" — it is "the reply never came." So the workflow is always the same: read the log to identify caller and callee, check RabbitMQ cluster health and the specific queue that should carry the reply, and only then decide what to restart. This guide is the runbook I use for exactly that, and it pairs with the deeper dives on troubleshooting RabbitMQ in OpenStack and the MessagingTimeout error reference. Want it in your hand during the incident? Grab the free runbook pack above.

How RPC works in OpenStack

OpenStack services don’t call each other over HTTP for internal work — they use RPC over oslo.messaging, which in almost every production deployment is backed by RabbitMQ (AMQP 0-9-1). Every service is configured with a transport_url in its config (e.g. rabbit://openstack:pass@ctrl1:5672,openstack:pass@ctrl2:5672,openstack:pass@ctrl3:5672/) that tells it how to reach the broker and, critically, which nodes to fail over between.

On the wire, oslo.messaging uses topic exchanges to route requests to the right service queue, and per-caller direct/reply queues to carry answers back. There are two call types:

  • cast — fire-and-forget. The caller publishes and moves on; no reply is expected. A lost cast fails silently, not with a timeout.
  • call — request/reply. The caller publishes to the service topic, then blocks on its reply queue waiting for the answer. This is the one that produces MessagingTimeout when the reply doesn’t arrive in rpc_response_timeout.

Underneath all of this, AMQP heartbeats keep each connection alive: the client and broker exchange small frames every few seconds so a dead peer is detected quickly. If a service is too busy to send heartbeats — or the network drops them — RabbitMQ decides the client is gone and closes the connection, taking any in-flight reply with it. That is the mechanism behind RabbitMQ missed heartbeats in OpenStack, and it’s a common, sneaky source of RPC timeouts.

Symptoms

You are probably here because you are seeing one or more of these:

  • MessagingTimeout: Timed out waiting for a reply to message ID ... in a service log — the oslo messaging timeout signature.
  • Operations hang then fail: server create stuck in BUILD, volumes stuck creating, Neutron ports stuck BUILD/DOWN, Heat stacks stuck CREATE_IN_PROGRESS.
  • RabbitMQ logs missed heartbeats from client, timeout: 60s — the rabbitmq missed heartbeats openstack pattern — followed by connection churn.
  • Blocked connections: publishers stall and clients report the connection is blocked, a sign the broker has hit a memory or disk watermark.
  • The failure spans multiple services — a classic openstack message queue timeout where Nova, Cinder, and Neutron all degrade at once because they share the broker.

Likely causes

An neutron cinder nova rpc timeout traces back to one of these, roughly in order of frequency:

  • Dead or slow consumer (the callee). The service that should answer the call is crashed, hung, or so overloaded it never publishes the reply. The caller times out.
  • Queue backlog with consumers=0. Messages pile up on a service queue that has no attached consumer — the callee lost its subscription and nothing is draining the queue.
  • Missed heartbeats. An overloaded event loop, packet loss, or an aggressive heartbeat_timeout_threshold causes RabbitMQ to close the connection mid-reply.
  • Blocked connections (backpressure). RabbitMQ hit its memory (vm_memory_high_watermark) or disk (disk_free_limit) alarm and blocked publishers — see RabbitMQ queue backpressure and flow control.
  • Cluster partition. A network partition split the RabbitMQ cluster; queues become unavailable or inconsistent and RPCs time out unpredictably.
  • Single-node transport_url. Only one broker node is listed, so when it hiccups there is no failover and every RPC stalls.
  • rpc_response_timeout too low for a genuinely slow operation (a huge scheduler pass or volume delete) — the least common real cause.

Immediate checks

Ninety seconds of triage tells you which side of the queue to chase. First, read the failing log and identify the caller and the callee — the timeout message names both the service that logged it and, usually, the target method/topic:

Identify caller and callee from the log
# The caller is whatever container logged this; note the target method/topic
docker logs --tail=120 nova_conductor 2>&1 | grep -Ei "MessagingTimeout|reply to message"
docker logs --tail=120 cinder_scheduler 2>&1 | grep -Ei "MessagingTimeout|reply to message"

# Typical line names the message ID and the unanswered call:
#   MessagingTimeout: Timed out waiting for a reply to message ID abc123
# -> caller = this service, callee = the topic/host it was calling

Whoever logs the timeout is the caller. The method or topic in the traceback tells you which callee (and therefore which queue) never answered.

Then confirm the broker itself is healthy before you blame any service, and find backed-up or consumer-less queues:

Broker health + backed-up queues at a glance
docker exec rabbitmq rabbitmqctl cluster_status
docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
  | awk 'NR>1 && ($2>100 || $3==0) {print}'

A partition in cluster_status changes everything (fix the broker first). Any queue with messages>100 or consumers==0 is your prime suspect.

Diagnostic commands

All read-only. Run these before restarting anything so you restart the right, narrowest component.

1. RabbitMQ cluster health

Cluster status, alarms, and partitions
docker exec rabbitmq rabbitmqctl cluster_status
# Look for the "partitions" section — it must be empty
docker exec rabbitmq rabbitmqctl list_alarms          # memory / disk / file-descriptor alarms
docker exec rabbitmq rabbitmqctl -q eval 'rabbit_node_monitor:partitions().'

Any non-empty partitions list or a raised alarm means the broker is the fault. Do NOT blind-restart a partitioned cluster — jump to the fixes section.

2. Queue depth

Which queues are backed up or unconsumed
docker exec rabbitmq rabbitmqctl list_queues name messages messages_ready consumers \
  | awk 'NR>1 && ($2>100 || $4==0) {print}'
# messages       = total in the queue
# messages_ready = waiting to be delivered (not yet acked)
# consumers      = attached consumers; 0 means nothing is draining it

messages climbing with consumers==0 = the callee lost its subscription. messages high with consumers>0 = the consumer is too slow, not absent.

3. Missed heartbeat checks

Find missed heartbeats and map them to a service
docker logs --tail=300 rabbitmq 2>&1 | grep -Ei "missed heartbeats from client"
# Each line carries the peer address, e.g.:
#   missed heartbeats from client, timeout: 60s ... connection <0.123.0> (10.0.0.14:54432 -> ...)
# Map the peer_host back to the offending service:
docker exec rabbitmq rabbitmqctl list_connections peer_host name state \
  | grep -Ei "10.0.0.14|blocked|flow"

Correlate the peer_host in the heartbeat log with list_connections to name the service dropping heartbeats — that box is CPU- or network-starved.

4. Blocked connection checks

Publishers blocked by backpressure
docker exec rabbitmq rabbitmqctl list_connections name peer_host state \
  | grep -Ei "blocked|blocking"
docker exec rabbitmq rabbitmqctl status | grep -A6 -Ei "alarms|memory|disk_free"

state = blocked/blocking means the broker hit a memory or disk watermark and paused publishers. This is backpressure — free resources, don't just restart.

5. Consumer / publisher checks

Confirm the callee is actually subscribed
# Is anything consuming the queue the caller was talking to?
docker exec rabbitmq rabbitmqctl list_consumers | grep -Ei "cinder-scheduler|nova_conductor|q-l3-plugin"
docker exec rabbitmq rabbitmqctl list_connections name peer_host user state channels

No consumer row for the expected queue confirms the callee isn't attached — restart the callee so it re-subscribes.

6. Kolla-Ansible RabbitMQ container checks

Container liveness and broker logs
docker ps --filter name=rabbitmq --format 'table {{.Names}}\t{{.Status}}'
docker logs --tail=200 rabbitmq 2>&1 \
  | grep -Ei "missed heartbeats|partition|closing|alarm|memory resource limit|disk"

A rabbitmq container flapping (restart count climbing) or logging alarms is a broker-level fault, not a service-level one.

Fix & remediation steps

Use this decision tree to map what you found to the smallest safe action:

RPC timeout restart decision tree
Broker partitioned or alarm raised?
  -> YES: fix the broker FIRST. Do NOT blind-restart services or the cluster.
          Recover the partition per policy, clear the memory/disk alarm, escalate.
  -> NO:  continue.

Target queue has consumers == 0?
  -> YES: the callee lost its subscription. Restart the CALLEE service
          (it re-subscribes and drains the backlog).

Queue backed up but consumers > 0?
  -> The consumer is too slow, not absent. Check its CPU/load, scale it out
     (add workers/replicas). Do NOT restart RabbitMQ.

Only missed heartbeats (no backlog, no alarm)?
  -> Restart the affected service to get a clean connection. If it recurs,
     tune heartbeat_timeout_threshold; check that host for CPU/network saturation.

Genuinely slow-but-healthy op (large delete/scheduler pass)?
  -> Consider a deliberate, documented rpc_response_timeout bump for that call only.

Work top-down. The broker check gates everything else — you never restart services while the cluster is partitioned.

Once you’ve identified the callee, restart the narrowest thing that re-subscribes it. In Kolla-Ansible these are single containers:

Targeted callee restarts (Kolla-Ansible)
# Restart only the consumer that lost its subscription:
docker restart cinder_scheduler      # Cinder RPC timeouts / volumes stuck creating
docker restart nova_conductor        # Nova conductor timeouts / server create stuck BUILD
docker restart neutron_l3_agent      # L3 RPC timeouts / routers + floating IPs down
docker restart heat_engine           # Heat stacks stuck IN_PROGRESS

# RabbitMQ is the LAST resort, and only when the broker is the confirmed fault.
# Single node at a time — never the whole cluster at once:
docker restart rabbitmq

Restart the service, then re-check the queue: consumers should return to >0 and messages should start draining toward 0.

See the sibling runbooks for the two most common callees: neutron-l3-agent dead / XXX state and Cinder scheduler timeout. For a queue that keeps refilling, the deeper walkthrough is diagnosing RabbitMQ queue buildup with AI.

Service-specific symptoms

Map the timeout to the affected service to jump straight to the prime-suspect queue:

Service What the timeout looks like Prime suspect queue / target
Nova Server create stuck in BUILD; nova-conductor logs MessagingTimeout conductor / compute fanout (conductor ↔ compute RPC)
Cinder Volume stuck creating; cinder-scheduler RPC timeout cinder-scheduler / cinder-volume.<host>
Neutron Ports stuck BUILD; agents show XXX; server logs RPC timeout to agents q-l3-plugin / q-plugin (server ↔ agent RPC)
Heat Stacks stuck CREATE_IN_PROGRESS; heat-engine RPC timeout engine / engine_worker
Free runbook · PDF

Grab the copy/paste version of this runbook

The RabbitMQ RPC Timeout Runbook Pack bundles every command on this page — cluster health, queue depth, heartbeat and blocked-connection checks, the Nova/Cinder/Neutron/Heat symptom matrix, and the service restart decision tree — in one print-ready PDF.

  • OpenStack RPC timeout checklist
  • RabbitMQ cluster + queue depth commands
  • Consumer / publisher + heartbeat checks
  • oslo.messaging config review
  • Nova/Cinder/Neutron/Heat symptom matrix
  • Service restart decision tree + notes template

No account needed · single opt-in · we never share your email.

Validation steps

Don’t declare victory on one successful call. Confirm the RPC path is genuinely healthy:

  • Queues draining: the previously backed-up queue trends toward messages ~0 with consumers > 0.
  • No new timeouts: tail the caller’s log — no fresh MessagingTimeout lines.
  • Real op per affected service: boot and delete a tiny instance, create and delete a small volume, create a test port, create and delete a trivial stack.
  • No blocked connections and no raised alarms on the broker.
Post-fix validation
docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
  | awk 'NR>1 && ($2>100 || $3==0) {print}'          # should print nothing
docker exec rabbitmq rabbitmqctl list_connections state | grep -c blocked   # expect 0
openstack compute service list -f value -c State | sort | uniq -c
openstack volume service list -f value -c Status | sort | uniq -c
openstack network agent list -f value -c Alive | sort | uniq -c

An empty queue-filter output, zero blocked connections, and all services up/alive means the RPC path recovered.

Prevention

  • Alert on the leading indicators, not the timeout: per-queue depth, consumers==0 on service queues, missed heartbeats in the broker log, and blocked connections. A MessagingTimeout should never be your first signal — see monitoring OpenStack with Prometheus.
  • List every broker node in transport_url on every service so a single-node hiccup fails over instead of stalling RPC.
  • Tune heartbeats sensibly — keep heartbeat_timeout_threshold generous enough for a busy event loop; never disable heartbeats to "fix" churn.
  • Right-size workers (*_workers / replicas) so consumers keep up under real concurrency and queues don’t back up.
  • Watch memory and disk watermarks so the broker never blocks publishers — the mechanics are in RabbitMQ backpressure and flow control.
  • Use quorum queues where appropriate for durability and cleaner partition behavior than classic mirrored queues.
  • Turn recurring incidents into reusable runbooks and use the free AI Incident Response assistant to draft triage steps fast.

Want the always-current prompts and tools behind this workflow? Browse the AI prompt library, the free in-browser DevOps tools, and — when a production incident needs senior hands — work with me directly.

Free runbook · PDF

Download the free RabbitMQ RPC Timeout Runbook Pack

A copy/paste runbook for oslo.messaging timeouts and missed heartbeats — cluster health, queue depth, and a service restart decision tree.

  • OpenStack RPC timeout checklist
  • RabbitMQ cluster + queue depth commands
  • Consumer / publisher + heartbeat checks
  • oslo.messaging config review
  • Nova/Cinder/Neutron/Heat symptom matrix
  • Service restart decision tree + notes template

No account needed · single opt-in · we never share your email.

Frequently asked questions

What causes a MessagingTimeout in OpenStack?
A MessagingTimeout means a service made an oslo.messaging RPC call and never got its reply within rpc_response_timeout (60s by default). The service that logs the error is the caller; the fault is almost always elsewhere — a dead or slow consumer on the other end, a backed-up RabbitMQ queue with no consumers, missed heartbeats dropping the AMQP connection, or the broker itself blocking publishers because a memory/disk watermark was hit. Treat the log line as "the reply never came," then go find out why on the callee and broker side.
How do I fix RabbitMQ missed heartbeats in OpenStack?
RabbitMQ logs missed heartbeats from client, timeout: 60s and closes the connection; the service then reconnects but any in-flight RPC reply is lost, surfacing as a timeout. Root causes are an overloaded service event loop (green threads starved so heartbeats aren’t sent), packet loss/latency on the AMQP path, or an overly aggressive heartbeat. Restart the affected service so it re-establishes a clean connection, confirm the box isn’t CPU- or network-saturated, and if it recurs, tune [oslo_messaging_rabbit] heartbeat_timeout_threshold and heartbeat_rate rather than disabling heartbeats.
Should I restart RabbitMQ or the OpenStack service?
Restart the consumer, not the messenger. RabbitMQ is stateful and clustered — a blind restart of a partitioned or alarmed broker can extend the outage or lose queued messages. If a queue has consumers=0, the consuming service (e.g. cinder_scheduler) is what lost its subscription; restarting it re-subscribes and drains the backlog. Only touch RabbitMQ when the broker itself is the confirmed fault (a partition or memory/disk alarm), and even then follow the cluster recovery procedure, single node at a time.
Why do Nova, Cinder, and Neutron operations hang on RPC?
These services are asynchronous: the API accepts your request and hands the real work to another process (conductor, scheduler, agent) over RPC. If that peer is dead, overloaded, or its queue isn’t being consumed, the API blocks waiting for a reply and the operation sits in BUILD, creating, or IN_PROGRESS until rpc_response_timeout fires. The hang is a symptom of a broken RPC path, not a slow API — which is why you diagnose the queue and the callee, not the endpoint you called.
Is raising rpc_response_timeout a good fix?
No — it’s a stopgap, not a fix. Bumping rpc_response_timeout only helps in the narrow case of a genuinely slow-but-healthy operation (a large volume delete, a heavy scheduler pass). If the reply is missing because a consumer is dead, a queue is unconsumed, or heartbeats are dropping, a higher timeout just makes users wait longer for the same failure. Fix the broken RPC path; raise the timeout only as a deliberate, documented allowance for a known-slow call.
How do I check RabbitMQ queue depth in Kolla-Ansible?
Exec into the container and use rabbitmqctl: docker exec rabbitmq rabbitmqctl list_queues name messages messages_ready consumers. Any queue where messages keeps climbing while consumers is 0 is your smoking gun — the consuming service isn’t attached. Pair it with docker exec rabbitmq rabbitmqctl cluster_status to rule out a partition and list_connections state to catch blocked publishers.