OpenStack Error Guide: 'MessagingTimeout' oslo.messaging /

Overview

MessagingTimeout is the error oslo.messaging raises when a service sends an RPC call over RabbitMQ and never receives a reply within rpc_response_timeout. Because nearly every OpenStack service (Nova, Neutron, Cinder, Glance tasks, Heat) talks over AMQP, a RabbitMQ problem surfaces as timeouts and “AMQP server closed connection” all across the control plane at once.

The literal errors you will see:

oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 9f3c1d2e4a5b6c7d8e9f0a1b2c3d4e5f

ERROR oslo.messaging._drivers.impl_rabbit [-] [...] AMQP server on controller-02:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None
ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server closed the connection. Check login credentials: Socket closed

It occurs whenever a service tries to use the message bus: scheduling an instance, plugging a port, creating a volume. Symptoms appear simultaneously in unrelated services, which is the tell that the problem is RabbitMQ/the transport, not any one service.

Symptoms

Many services log MessagingTimeout or “AMQP server … is unreachable” at the same time.
openstack commands hang then fail; agents show as down across hosts.
Instances stick in BUILD, volumes in creating, ports unbound.

openstack compute service list -c Binary -c Host -c State

+----------------+------------+-------+
| Binary         | Host       | State |
+----------------+------------+-------+
| nova-conductor | controller | down  |
| nova-scheduler | controller | down  |
| nova-compute   | compute-01 | down  |
| nova-compute   | compute-02 | down  |
+----------------+------------+-------+

docker logs nova_conductor 2>&1 | grep -iE "MessagingTimeout|AMQP server" | tail -3

ERROR oslo.messaging._drivers.impl_rabbit AMQP server on 10.0.0.12:5672 is unreachable: [Errno 111] ECONNREFUSED.

Common Root Causes

1. RabbitMQ is down

The broker crashed, OOM-killed, or never started. With no broker, every RPC times out.

docker ps --filter name=rabbitmq --format '{{.Names}} {{.Status}}'
docker exec rabbitmq rabbitmqctl status 2>/dev/null | head -20
# Traditional
sudo systemctl status rabbitmq-server --no-pager

rabbitmq Restarting (1) 5 seconds ago
Error: unable to perform an operation on node 'rabbit@controller-01'. ... nodedown

2. RabbitMQ cluster partition (split-brain)

In an HA cluster, a network blip can partition nodes. Mirrored queues become unavailable and publishes/consumes stall.

docker exec rabbitmq rabbitmqctl cluster_status 2>/dev/null

Network Partitions
  Partitions:
    rabbit@controller-01:
      - rabbit@controller-03

A non-empty Network Partitions section means split-brain — clients on one side cannot reach mirrored queues on the other.

3. Network / firewall blocks port 5672 (or 5671 TLS)

A new firewall rule, security group, or routing change cuts the AMQP path between a service host and the broker.

ss -ltnp | grep -E ':567(1|2)'        # on the rabbit host
nc -vz 10.0.0.12 5672                  # from a service host
sudo iptables -L -n | grep 5672

Connection to 10.0.0.12 5672 port [tcp/*] failed: Connection timed out

A timed out (not refused) typically means a firewall is dropping packets.

4. Wrong credentials / vhost

After a password rotation or partial redeploy, services authenticate with stale credentials and the broker closes the connection.

grep -E '^transport_url' /etc/nova/nova.conf
docker exec rabbitmq rabbitmqctl list_users 2>/dev/null
docker exec rabbitmq rabbitmqctl list_vhosts 2>/dev/null

transport_url = rabbit://openstack:STALEPASS@10.0.0.12:5672//

ERROR oslo.messaging._drivers.impl_rabbit AMQP server closed the connection. Check login credentials: Socket closed

5. Queue buildup / memory or disk alarm

If consumers fall behind, queues grow until RabbitMQ trips its memory or disk_free alarm and blocks publishers — which then time out.

docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
  2>/dev/null | sort -k2 -n -r | head -10
docker exec rabbitmq rabbitmqctl status 2>/dev/null \
  | grep -iE 'alarm|mem_used|disk_free'

reply_q_nova_conductor   148233   0
notifications.info        92011   1

A queue with 100k+ messages and 0 consumers (or a raised mem_alarm) blocks the bus.

6. Clock skew or connection limits

Large clock skew breaks token/heartbeat assumptions; hitting RabbitMQ’s connection/file-descriptor limit refuses new connections.

timedatectl status | grep -E 'synchronized|NTP'
docker exec rabbitmq rabbitmqctl status 2>/dev/null \
  | grep -iE 'connections|file_descriptors|sockets_used'
docker exec rabbitmq rabbitmqctl list_connections 2>/dev/null | wc -l

System clock synchronized: no
{file_descriptors,[{total_limit,1024},{total_used,1023}, ...]}

A near-exhausted FD limit means new AMQP connections are refused.

Diagnostic Workflow

Step 1: Confirm it’s the bus, not one service

openstack compute service list -c Binary -c Host -c State
openstack network agent list -c "Agent Type" -c Host -c Alive

If services across multiple hosts are down/not-alive simultaneously, suspect RabbitMQ.

Step 2: Check broker process and cluster state

# Kolla-Ansible
docker ps --filter name=rabbitmq
docker exec rabbitmq rabbitmqctl cluster_status
# Traditional
sudo systemctl status rabbitmq-server --no-pager
sudo rabbitmqctl cluster_status

Look for a down node or a Network Partitions block.

Step 3: Read a service’s AMQP log for the exact failure

# Kolla-Ansible
docker logs nova_conductor 2>&1 | grep -iE "MessagingTimeout|AMQP server|credentials" | tail -10
# Traditional
sudo journalctl -u nova-conductor --no-pager | grep -iE "MessagingTimeout|AMQP server" | tail -10

ECONNREFUSED → broker down; timed out → firewall; “Check login credentials” → auth/vhost.

Step 4: Test connectivity and credentials from a service host

nc -vz <RABBIT_HOST> 5672
grep -E '^transport_url' /etc/nova/nova.conf
docker exec rabbitmq rabbitmqctl list_users
docker exec rabbitmq rabbitmqctl list_permissions -p /

Step 5: Inspect queues and alarms

docker exec rabbitmq rabbitmqctl list_queues name messages consumers | sort -k2 -n -r | head
docker exec rabbitmq rabbitmqctl status | grep -iE 'alarm|mem_used|disk_free|file_descriptors'
timedatectl status | grep synchronized

Example Root Cause Analysis

At 03:10 the on-call sees Nova, Neutron, and Cinder agents all flip to down, and new instances stick in BUILD. The nova-conductor log:

ERROR oslo.messaging._drivers.impl_rabbit AMQP server on 10.0.0.12:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 9f3c1d2e...

ECONNREFUSED points at the broker itself. Checking the RabbitMQ container:

docker ps --filter name=rabbitmq --format '{{.Names}} {{.Status}}'

rabbitmq Restarting (137) 8 seconds ago

Exit 137 is an OOM kill. The RabbitMQ log confirms a memory alarm preceded the crash, caused by a runaway notifications.info queue with no consumer:

docker logs rabbitmq 2>&1 | grep -iE 'memory|alarm|oom' | tail -5

vm_memory_high_watermark set. Memory used:... Disk free limit ... memory resource limit alarm set on node rabbit@controller-01.

Fix: drain the orphaned queue and restart the broker, then confirm services recover:

docker exec rabbitmq rabbitmqctl purge_queue notifications.info
docker restart rabbitmq
docker exec rabbitmq rabbitmqctl cluster_status   # node running, no partitions
openstack compute service list -c Binary -c State # States flip back to 'up'

Longer term, disable the unused notification topic (or attach a consumer) so the queue cannot grow unbounded again.

Prevention Best Practices

Monitor RabbitMQ directly: node up/down, mem_used vs. watermark, disk_free, partitions, and total connections/FDs. Page on any raised alarm.
Alert on per-queue depth (rabbitmqctl list_queues messages consumers); a queue climbing with 0 consumers is an early warning before the bus blocks.
Run RabbitMQ HA with an odd node count and a sane partition handling policy (pause_minority) so a blip self-heals instead of splitting brain.
Keep transport_url and broker credentials managed by config management so rotations update every service atomically.
Sync clocks with NTP/chrony on all control and compute nodes; verify with timedatectl.
Pre-open 5672/5671 in firewalls and security groups for all service hosts, and test with nc -vz after any network change.
For triage, drop the simultaneous MessagingTimeout traces into the free incident assistant to confirm a transport-wide outage, and see more OpenStack guides.

Quick Command Reference

# Is the whole bus down? (services across hosts go down together)
openstack compute service list -c Binary -c Host -c State
openstack network agent list -c "Agent Type" -c Host -c Alive

# Broker process & cluster state
docker ps --filter name=rabbitmq
docker exec rabbitmq rabbitmqctl cluster_status
sudo systemctl status rabbitmq-server --no-pager

# Exact AMQP error from a service
docker logs nova_conductor 2>&1 | grep -iE "MessagingTimeout|AMQP server|credentials" | tail -10
sudo journalctl -u nova-conductor | grep -iE "MessagingTimeout|AMQP server" | tail -10

# Connectivity & credentials
nc -vz <RABBIT_HOST> 5672
grep -E '^transport_url' /etc/nova/nova.conf
docker exec rabbitmq rabbitmqctl list_users
docker exec rabbitmq rabbitmqctl list_permissions -p /

# Queues, alarms, limits, clock
docker exec rabbitmq rabbitmqctl list_queues name messages consumers | sort -k2 -n -r | head
docker exec rabbitmq rabbitmqctl status | grep -iE 'alarm|mem_used|disk_free|file_descriptors'
timedatectl status | grep synchronized

# Recover
docker exec rabbitmq rabbitmqctl purge_queue <QUEUE>
docker restart rabbitmq

Conclusion

MessagingTimeout and “AMQP server closed connection” are transport-layer failures: the RPC bus, not the calling service, is broken. The simultaneous, cross-service nature is the diagnostic signature. Typical root causes:

RabbitMQ is down (crash/OOM/not started).
A cluster network partition (split-brain) makes mirrored queues unavailable.
A firewall, security group, or routing change blocks port 5672/5671.
Stale or wrong credentials/vhost after a rotation or partial redeploy.
Queue buildup tripping a memory/disk alarm that blocks publishers.
Clock skew or exhausted connection/file-descriptor limits refusing connections.

Confirm the broker and cluster status first; the ECONNREFUSED vs. timed out vs. “Check login credentials” distinction in the log tells you which of these you’re chasing.

OpenStack Error Guide: 'MessagingTimeout' oslo.messaging / RabbitMQ Unreachable