OpenStack Error Guide: 'MessagingTimeout' oslo.messaging / RabbitMQ Unreachable
Fix oslo.messaging MessagingTimeout and 'AMQP server closed connection' errors in OpenStack: diagnose RabbitMQ down, partitions, firewall to 5672, creds, and queue buildup.
- #openstack
- #troubleshooting
- #errors
- #rabbitmq
Overview
MessagingTimeout is the error oslo.messaging raises when a service sends an RPC call over RabbitMQ and never receives a reply within rpc_response_timeout. Because nearly every OpenStack service (Nova, Neutron, Cinder, Glance tasks, Heat) talks over AMQP, a RabbitMQ problem surfaces as timeouts and “AMQP server closed connection” all across the control plane at once.
The literal errors you will see:
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 9f3c1d2e4a5b6c7d8e9f0a1b2c3d4e5f
ERROR oslo.messaging._drivers.impl_rabbit [-] [...] AMQP server on controller-02:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None
ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server closed the connection. Check login credentials: Socket closed
It occurs whenever a service tries to use the message bus: scheduling an instance, plugging a port, creating a volume. Symptoms appear simultaneously in unrelated services, which is the tell that the problem is RabbitMQ/the transport, not any one service.
Symptoms
- Many services log
MessagingTimeoutor “AMQP server … is unreachable” at the same time. openstackcommands hang then fail; agents show as down across hosts.- Instances stick in
BUILD, volumes increating, ports unbound.
openstack compute service list -c Binary -c Host -c State
+----------------+------------+-------+
| Binary | Host | State |
+----------------+------------+-------+
| nova-conductor | controller | down |
| nova-scheduler | controller | down |
| nova-compute | compute-01 | down |
| nova-compute | compute-02 | down |
+----------------+------------+-------+
docker logs nova_conductor 2>&1 | grep -iE "MessagingTimeout|AMQP server" | tail -3
ERROR oslo.messaging._drivers.impl_rabbit AMQP server on 10.0.0.12:5672 is unreachable: [Errno 111] ECONNREFUSED.
Common Root Causes
1. RabbitMQ is down
The broker crashed, OOM-killed, or never started. With no broker, every RPC times out.
docker ps --filter name=rabbitmq --format '{{.Names}} {{.Status}}'
docker exec rabbitmq rabbitmqctl status 2>/dev/null | head -20
# Traditional
sudo systemctl status rabbitmq-server --no-pager
rabbitmq Restarting (1) 5 seconds ago
Error: unable to perform an operation on node 'rabbit@controller-01'. ... nodedown
2. RabbitMQ cluster partition (split-brain)
In an HA cluster, a network blip can partition nodes. Mirrored queues become unavailable and publishes/consumes stall.
docker exec rabbitmq rabbitmqctl cluster_status 2>/dev/null
Network Partitions
Partitions:
rabbit@controller-01:
- rabbit@controller-03
A non-empty Network Partitions section means split-brain — clients on one side cannot reach mirrored queues on the other.
3. Network / firewall blocks port 5672 (or 5671 TLS)
A new firewall rule, security group, or routing change cuts the AMQP path between a service host and the broker.
ss -ltnp | grep -E ':567(1|2)' # on the rabbit host
nc -vz 10.0.0.12 5672 # from a service host
sudo iptables -L -n | grep 5672
Connection to 10.0.0.12 5672 port [tcp/*] failed: Connection timed out
A timed out (not refused) typically means a firewall is dropping packets.
4. Wrong credentials / vhost
After a password rotation or partial redeploy, services authenticate with stale credentials and the broker closes the connection.
grep -E '^transport_url' /etc/nova/nova.conf
docker exec rabbitmq rabbitmqctl list_users 2>/dev/null
docker exec rabbitmq rabbitmqctl list_vhosts 2>/dev/null
transport_url = rabbit://openstack:STALEPASS@10.0.0.12:5672//
ERROR oslo.messaging._drivers.impl_rabbit AMQP server closed the connection. Check login credentials: Socket closed
5. Queue buildup / memory or disk alarm
If consumers fall behind, queues grow until RabbitMQ trips its memory or disk_free alarm and blocks publishers — which then time out.
docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
2>/dev/null | sort -k2 -n -r | head -10
docker exec rabbitmq rabbitmqctl status 2>/dev/null \
| grep -iE 'alarm|mem_used|disk_free'
reply_q_nova_conductor 148233 0
notifications.info 92011 1
A queue with 100k+ messages and 0 consumers (or a raised mem_alarm) blocks the bus.
6. Clock skew or connection limits
Large clock skew breaks token/heartbeat assumptions; hitting RabbitMQ’s connection/file-descriptor limit refuses new connections.
timedatectl status | grep -E 'synchronized|NTP'
docker exec rabbitmq rabbitmqctl status 2>/dev/null \
| grep -iE 'connections|file_descriptors|sockets_used'
docker exec rabbitmq rabbitmqctl list_connections 2>/dev/null | wc -l
System clock synchronized: no
{file_descriptors,[{total_limit,1024},{total_used,1023}, ...]}
A near-exhausted FD limit means new AMQP connections are refused.
Diagnostic Workflow
Step 1: Confirm it’s the bus, not one service
openstack compute service list -c Binary -c Host -c State
openstack network agent list -c "Agent Type" -c Host -c Alive
If services across multiple hosts are down/not-alive simultaneously, suspect RabbitMQ.
Step 2: Check broker process and cluster state
# Kolla-Ansible
docker ps --filter name=rabbitmq
docker exec rabbitmq rabbitmqctl cluster_status
# Traditional
sudo systemctl status rabbitmq-server --no-pager
sudo rabbitmqctl cluster_status
Look for a down node or a Network Partitions block.
Step 3: Read a service’s AMQP log for the exact failure
# Kolla-Ansible
docker logs nova_conductor 2>&1 | grep -iE "MessagingTimeout|AMQP server|credentials" | tail -10
# Traditional
sudo journalctl -u nova-conductor --no-pager | grep -iE "MessagingTimeout|AMQP server" | tail -10
ECONNREFUSED → broker down; timed out → firewall; “Check login credentials” → auth/vhost.
Step 4: Test connectivity and credentials from a service host
nc -vz <RABBIT_HOST> 5672
grep -E '^transport_url' /etc/nova/nova.conf
docker exec rabbitmq rabbitmqctl list_users
docker exec rabbitmq rabbitmqctl list_permissions -p /
Step 5: Inspect queues and alarms
docker exec rabbitmq rabbitmqctl list_queues name messages consumers | sort -k2 -n -r | head
docker exec rabbitmq rabbitmqctl status | grep -iE 'alarm|mem_used|disk_free|file_descriptors'
timedatectl status | grep synchronized
Example Root Cause Analysis
At 03:10 the on-call sees Nova, Neutron, and Cinder agents all flip to down, and new instances stick in BUILD. The nova-conductor log:
ERROR oslo.messaging._drivers.impl_rabbit AMQP server on 10.0.0.12:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 9f3c1d2e...
ECONNREFUSED points at the broker itself. Checking the RabbitMQ container:
docker ps --filter name=rabbitmq --format '{{.Names}} {{.Status}}'
rabbitmq Restarting (137) 8 seconds ago
Exit 137 is an OOM kill. The RabbitMQ log confirms a memory alarm preceded the crash, caused by a runaway notifications.info queue with no consumer:
docker logs rabbitmq 2>&1 | grep -iE 'memory|alarm|oom' | tail -5
vm_memory_high_watermark set. Memory used:... Disk free limit ... memory resource limit alarm set on node rabbit@controller-01.
Fix: drain the orphaned queue and restart the broker, then confirm services recover:
docker exec rabbitmq rabbitmqctl purge_queue notifications.info
docker restart rabbitmq
docker exec rabbitmq rabbitmqctl cluster_status # node running, no partitions
openstack compute service list -c Binary -c State # States flip back to 'up'
Longer term, disable the unused notification topic (or attach a consumer) so the queue cannot grow unbounded again.
Prevention Best Practices
- Monitor RabbitMQ directly: node up/down,
mem_usedvs. watermark,disk_free, partitions, and total connections/FDs. Page on any raised alarm. - Alert on per-queue depth (
rabbitmqctl list_queues messages consumers); a queue climbing with0consumers is an early warning before the bus blocks. - Run RabbitMQ HA with an odd node count and a sane partition handling policy (
pause_minority) so a blip self-heals instead of splitting brain. - Keep
transport_urland broker credentials managed by config management so rotations update every service atomically. - Sync clocks with NTP/chrony on all control and compute nodes; verify with
timedatectl. - Pre-open 5672/5671 in firewalls and security groups for all service hosts, and test with
nc -vzafter any network change. - For triage, drop the simultaneous
MessagingTimeouttraces into the free incident assistant to confirm a transport-wide outage, and see more OpenStack guides.
Quick Command Reference
# Is the whole bus down? (services across hosts go down together)
openstack compute service list -c Binary -c Host -c State
openstack network agent list -c "Agent Type" -c Host -c Alive
# Broker process & cluster state
docker ps --filter name=rabbitmq
docker exec rabbitmq rabbitmqctl cluster_status
sudo systemctl status rabbitmq-server --no-pager
# Exact AMQP error from a service
docker logs nova_conductor 2>&1 | grep -iE "MessagingTimeout|AMQP server|credentials" | tail -10
sudo journalctl -u nova-conductor | grep -iE "MessagingTimeout|AMQP server" | tail -10
# Connectivity & credentials
nc -vz <RABBIT_HOST> 5672
grep -E '^transport_url' /etc/nova/nova.conf
docker exec rabbitmq rabbitmqctl list_users
docker exec rabbitmq rabbitmqctl list_permissions -p /
# Queues, alarms, limits, clock
docker exec rabbitmq rabbitmqctl list_queues name messages consumers | sort -k2 -n -r | head
docker exec rabbitmq rabbitmqctl status | grep -iE 'alarm|mem_used|disk_free|file_descriptors'
timedatectl status | grep synchronized
# Recover
docker exec rabbitmq rabbitmqctl purge_queue <QUEUE>
docker restart rabbitmq
Conclusion
MessagingTimeout and “AMQP server closed connection” are transport-layer failures: the RPC bus, not the calling service, is broken. The simultaneous, cross-service nature is the diagnostic signature. Typical root causes:
- RabbitMQ is down (crash/OOM/not started).
- A cluster network partition (split-brain) makes mirrored queues unavailable.
- A firewall, security group, or routing change blocks port 5672/5671.
- Stale or wrong credentials/vhost after a rotation or partial redeploy.
- Queue buildup tripping a memory/disk alarm that blocks publishers.
- Clock skew or exhausted connection/file-descriptor limits refusing connections.
Confirm the broker and cluster status first; the ECONNREFUSED vs. timed out vs. “Check login credentials” distinction in the log tells you which of these you’re chasing.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.