OpenStack Error Guide: 'Missed heartbeats from client, timeout: 60 seconds' RabbitMQ Connection Churn
Fix oslo.messaging 'Missed heartbeats from client' and 'AMQP server is unreachable' errors in OpenStack: tune heartbeats, eventlet, firewalls, and RabbitMQ HA.
- #openstack
- #troubleshooting
- #errors
- #rabbitmq
Overview
OpenStack services talk to each other over RabbitMQ using oslo.messaging. Every connection runs an AMQP heartbeat so the broker and client can detect a dead peer. When the client fails to send its heartbeat frame in time, RabbitMQ tears the connection down and logs a missed-heartbeat error; the OpenStack service then logs that the AMQP server is unreachable and reconnects. Under load this becomes constant connection churn that stalls RPC calls, delays instance scheduling, and produces sporadic MessagingTimeout failures.
You will see this on the RabbitMQ side:
=ERROR REPORT==== 24-Jun-2026::14:02:11 ===
closing AMQP connection <0.30125.7> (10.0.0.12:54122 -> 10.0.0.10:5672 - nova-compute:1234):
missed heartbeats from client, timeout: 60 seconds
And in the OpenStack service log (here nova-compute):
ERROR oslo.messaging._drivers.impl_rabbit [req-...] [a1b2c3d4] AMQP server on 10.0.0.10:5672 is unreachable: <RecoverableConnectionError: (0, 0, '', '')>. Trying again in 1 seconds. Client port: 54122
The binding is per-connection, so a controller that is fine for hours can suddenly shed dozens of connections when an event loop blocks or the broker gets overloaded.
Symptoms
- RabbitMQ logs repeated
missed heartbeats from client, timeout: 60 secondsandclosing AMQP connection. - Services log
AMQP server on ... is unreachablefollowed byReconnected to AMQP server. rabbitmqctl list_connectionsshows connection counts that spike and collapse.- RPC operations (boot, attach volume, list) intermittently time out with
MessagingTimeout.
sudo rabbitmqctl list_connections name state | head
Listing connections ...
10.0.0.12:54122 -> 10.0.0.10:5672 running
10.0.0.13:51990 -> 10.0.0.10:5672 closing
docker logs nova_compute 2>&1 | grep -ci "AMQP server on .* is unreachable"
317
Common Root Causes
1. Heartbeat timeout too aggressive for the client
oslo.messaging sends a heartbeat at roughly heartbeat_timeout_threshold / 2. If the threshold is low and the service is briefly busy, it misses the window and the broker drops it.
grep -E '^(heartbeat_timeout_threshold|heartbeat_rate)' /etc/nova/nova.conf
heartbeat_timeout_threshold = 60
heartbeat_rate = 2
With 60s/rate-2 the client should heartbeat every ~15s; a worker blocked longer than that gets reaped.
2. eventlet/native-thread blocking the heartbeat
Heartbeats run on the service’s green thread (or, with [oslo_messaging_rabbit] heartbeat_in_pthread, a real thread). A long synchronous call — a slow DB query, a blocking C library, image hashing — monopolizes the loop and the heartbeat frame never fires.
grep -E 'heartbeat_in_pthread' /etc/nova/nova.conf
heartbeat_in_pthread = false
Setting heartbeat_in_pthread = true moves heartbeats off the blocked eventlet loop for most services.
3. Network or firewall dropping idle TCP
A stateful firewall, conntrack table, or load balancer with a short idle timeout silently drops idle AMQP sockets. The next heartbeat lands on a dead connection.
sudo sysctl net.netfilter.nf_conntrack_tcp_timeout_established
ss -tnp | grep ':5672'
net.netfilter.nf_conntrack_tcp_timeout_established = 3600
ESTAB 0 0 10.0.0.12:54122 10.0.0.10:5672 users:(("nova-compute",pid=1234,fd=21))
Tune net.ipv4.tcp_keepalive_time below the firewall idle timeout, or keep heartbeat_timeout_threshold shorter than it.
4. RabbitMQ partition or unsynced HA queues
In a clustered/HA setup a network partition or unmirrored queue causes the cluster to drop or stall connections, which surfaces as mass heartbeat loss.
sudo rabbitmqctl cluster_status
Cluster status of node rabbit@controller-01 ...
Network Partitions
rabbit@controller-02 cannot communicate with rabbit@controller-03
Any non-empty Network Partitions section means split-brain; resolve it before chasing client tuning.
5. File descriptor / socket limits exhausted
When the broker or a controller hits its fd ceiling it cannot accept new sockets, and existing ones get starved, producing heartbeat misses across the board.
sudo rabbitmqctl status | grep -A3 file_descriptors
{file_descriptors,
[{total_limit,65536},{total_used,65210},{sockets_limit,58981},{sockets_used,58970}]}
total_used near total_limit means you are out of headroom; raise LimitNOFILE.
6. Overloaded controllers / oversized queues
A controller pegged on CPU, or a service with too many RPC workers hammering the broker, starves the heartbeat thread and inflates queue depth.
sudo rabbitmqctl list_queues name messages consumers memory | sort -k2 -n -r | head
Listing queues ...
reply_8f3c... 148210 1 402653184
conductor 9921 24 33554432
A reply queue backed up into six figures points at a stuck/slow consumer rather than client config.
Diagnostic Workflow
Step 1: Confirm it is heartbeats, not a hard outage
# Kolla-Ansible
docker logs rabbitmq 2>&1 | grep -i "missed heartbeats" | tail -5
# Traditional packages
sudo journalctl -u rabbitmq-server --no-pager | grep -i "missed heartbeats" | tail -5
Repeated missed heartbeats from client (not connection_closed_abruptly) confirms the heartbeat path is the issue.
Step 2: Check cluster health and partitions first
sudo rabbitmqctl cluster_status
sudo rabbitmqctl list_queues name messages consumers | sort -k2 -n -r | head
A partition or a runaway queue is a broker-side problem; fix it before touching client timeouts.
Step 3: Inspect the connection churn
sudo rabbitmqctl list_connections name user state channels | sort -k4 -n -r | head
watch -n2 'sudo rabbitmqctl list_connections | wc -l'
A connection count that oscillates by tens every few seconds is the churn signature.
Step 4: Check the client service for blocking and current settings
# Kolla-Ansible (on the compute)
docker logs nova_compute 2>&1 | grep -iE "unreachable|Reconnected to AMQP" | tail -20
# Traditional
sudo journalctl -u nova-compute --no-pager | grep -iE "unreachable|Reconnected" | tail -20
grep -E 'heartbeat_timeout_threshold|heartbeat_in_pthread|kombu_reconnect_delay' /etc/nova/nova.conf
Correlate the unreachable timestamps with slow operations in the same log.
Step 5: Verify fd limits and network keepalive, then tune
sudo rabbitmqctl status | grep -A3 file_descriptors
cat /proc/$(pgrep -f beam.smp | head -1)/limits | grep "open files"
sudo sysctl net.ipv4.tcp_keepalive_time
After raising limits or adjusting timeouts, restart the affected service and re-watch the churn.
Example Root Cause Analysis
After adding ten compute nodes, controllers begin logging waves of missed heartbeats from client, timeout: 60 seconds, and nova-compute across the fleet logs AMQP server is unreachable every few minutes.
cluster_status shows no partitions, and queue depths are normal, so the broker itself is healthy. The connection list, however, churns constantly:
sudo rabbitmqctl list_connections state | sort | uniq -c
41 closing
612 running
Checking the broker’s file descriptors:
{file_descriptors,[{total_limit,4096},{total_used,4061},{sockets_limit,3680},{sockets_used,3679}]}
The default LimitNOFILE was never raised after the fleet grew, so RabbitMQ is starved of sockets and dropping connections, which the clients see as missed heartbeats.
Fix: raise the broker fd limit and restart, then move heartbeats off eventlet on the clients:
# Kolla: set rabbitmq LimitNOFILE / ulimits, then
docker restart rabbitmq
# On computes: heartbeat_in_pthread = true, heartbeat_timeout_threshold = 60
docker restart nova_compute
Socket usage drops well below the new ceiling, the connection count stabilizes, and the missed-heartbeat errors stop.
Prevention Best Practices
- Set
heartbeat_in_pthread = truefor eventlet-based services so a blocked green thread cannot starve heartbeats. - Size
LimitNOFILEfor RabbitMQ to comfortably exceedconnections + channelsfor the whole fleet, and alert whensockets_usedpasses ~80% ofsockets_limit. - Keep
heartbeat_timeout_threshold(default 60s) shorter than any firewall/conntrack idle timeout, and set TCP keepalive accordingly so idle sockets are never silently dropped. - Monitor
rabbitmqctl cluster_statusfor partitions and mirror/quorum critical queues so an HA event does not cascade into fleet-wide heartbeat loss. - Watch reply-queue depth with
rabbitmqctl list_queues name messages consumers; a backing-upreply_*queue is an early sign of a stuck consumer before churn begins. - For ad-hoc triage, the free incident assistant can correlate
missed heartbeatsandunreachablelog spikes into a likely cause. See more in OpenStack guides.
Quick Command Reference
# Confirm missed heartbeats on the broker
docker logs rabbitmq 2>&1 | grep -i "missed heartbeats" | tail -5
sudo journalctl -u rabbitmq-server | grep -i "missed heartbeats" | tail -5
# Cluster, partitions, and queue depth
sudo rabbitmqctl cluster_status
sudo rabbitmqctl list_queues name messages consumers memory | sort -k2 -n -r | head
# Connection churn
sudo rabbitmqctl list_connections name user state channels
watch -n2 'sudo rabbitmqctl list_connections | wc -l'
# File descriptors / socket limits
sudo rabbitmqctl status | grep -A3 file_descriptors
# Client-side errors and current oslo.messaging tuning
docker logs nova_compute 2>&1 | grep -iE "unreachable|Reconnected to AMQP" | tail -20
grep -E 'heartbeat_timeout_threshold|heartbeat_in_pthread|kombu_reconnect_delay' /etc/nova/nova.conf
# Restart after tuning
docker restart rabbitmq nova_compute
sudo systemctl restart rabbitmq-server nova-compute
Conclusion
A missed heartbeats from client, timeout: 60 seconds error means an OpenStack service failed to send its AMQP heartbeat before the broker’s deadline, so RabbitMQ closed the connection and the client logged the server as unreachable. The usual root causes:
- A
heartbeat_timeout_thresholdtoo aggressive for how busy the client gets. - eventlet/native-thread blocking that starves the heartbeat frame.
- A firewall, conntrack, or load balancer dropping idle AMQP TCP sockets.
- A RabbitMQ cluster partition or unsynced HA queues.
- File-descriptor/socket limits exhausted on the broker.
- Overloaded controllers or backed-up reply queues.
Rule out broker partitions and fd exhaustion first, then move heartbeats off the blocked event loop and align timeouts with your network — the churn almost always traces to one of those.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.