OpenStack Error Guide: 'Missed heartbeats from client

Overview

OpenStack services talk to each other over RabbitMQ using oslo.messaging. Every connection runs an AMQP heartbeat so the broker and client can detect a dead peer. When the client fails to send its heartbeat frame in time, RabbitMQ tears the connection down and logs a missed-heartbeat error; the OpenStack service then logs that the AMQP server is unreachable and reconnects. Under load this becomes constant connection churn that stalls RPC calls, delays instance scheduling, and produces sporadic MessagingTimeout failures.

You will see this on the RabbitMQ side:

=ERROR REPORT==== 24-Jun-2026::14:02:11 ===
closing AMQP connection <0.30125.7> (10.0.0.12:54122 -> 10.0.0.10:5672 - nova-compute:1234):
missed heartbeats from client, timeout: 60 seconds

And in the OpenStack service log (here nova-compute):

ERROR oslo.messaging._drivers.impl_rabbit [req-...] [a1b2c3d4] AMQP server on 10.0.0.10:5672 is unreachable: <RecoverableConnectionError: (0, 0, '', '')>. Trying again in 1 seconds. Client port: 54122

The binding is per-connection, so a controller that is fine for hours can suddenly shed dozens of connections when an event loop blocks or the broker gets overloaded.

Symptoms

RabbitMQ logs repeated missed heartbeats from client, timeout: 60 seconds and closing AMQP connection.
Services log AMQP server on ... is unreachable followed by Reconnected to AMQP server.
rabbitmqctl list_connections shows connection counts that spike and collapse.
RPC operations (boot, attach volume, list) intermittently time out with MessagingTimeout.

sudo rabbitmqctl list_connections name state | head

Listing connections ...
10.0.0.12:54122 -> 10.0.0.10:5672  running
10.0.0.13:51990 -> 10.0.0.10:5672  closing

docker logs nova_compute 2>&1 | grep -ci "AMQP server on .* is unreachable"

Common Root Causes

1. Heartbeat timeout too aggressive for the client

oslo.messaging sends a heartbeat at roughly heartbeat_timeout_threshold / 2. If the threshold is low and the service is briefly busy, it misses the window and the broker drops it.

grep -E '^(heartbeat_timeout_threshold|heartbeat_rate)' /etc/nova/nova.conf

heartbeat_timeout_threshold = 60
heartbeat_rate = 2

With 60s/rate-2 the client should heartbeat every ~15s; a worker blocked longer than that gets reaped.

2. eventlet/native-thread blocking the heartbeat

Heartbeats run on the service’s green thread (or, with [oslo_messaging_rabbit] heartbeat_in_pthread, a real thread). A long synchronous call — a slow DB query, a blocking C library, image hashing — monopolizes the loop and the heartbeat frame never fires.

grep -E 'heartbeat_in_pthread' /etc/nova/nova.conf

heartbeat_in_pthread = false

Setting heartbeat_in_pthread = true moves heartbeats off the blocked eventlet loop for most services.

3. Network or firewall dropping idle TCP

A stateful firewall, conntrack table, or load balancer with a short idle timeout silently drops idle AMQP sockets. The next heartbeat lands on a dead connection.

sudo sysctl net.netfilter.nf_conntrack_tcp_timeout_established
ss -tnp | grep ':5672'

net.netfilter.nf_conntrack_tcp_timeout_established = 3600
ESTAB 0 0 10.0.0.12:54122 10.0.0.10:5672 users:(("nova-compute",pid=1234,fd=21))

Tune net.ipv4.tcp_keepalive_time below the firewall idle timeout, or keep heartbeat_timeout_threshold shorter than it.

4. RabbitMQ partition or unsynced HA queues

In a clustered/HA setup a network partition or unmirrored queue causes the cluster to drop or stall connections, which surfaces as mass heartbeat loss.

sudo rabbitmqctl cluster_status

Cluster status of node rabbit@controller-01 ...
Network Partitions
  rabbit@controller-02 cannot communicate with rabbit@controller-03

Any non-empty Network Partitions section means split-brain; resolve it before chasing client tuning.

5. File descriptor / socket limits exhausted

When the broker or a controller hits its fd ceiling it cannot accept new sockets, and existing ones get starved, producing heartbeat misses across the board.

sudo rabbitmqctl status | grep -A3 file_descriptors

{file_descriptors,
    [{total_limit,65536},{total_used,65210},{sockets_limit,58981},{sockets_used,58970}]}

total_used near total_limit means you are out of headroom; raise LimitNOFILE.

6. Overloaded controllers / oversized queues

A controller pegged on CPU, or a service with too many RPC workers hammering the broker, starves the heartbeat thread and inflates queue depth.

sudo rabbitmqctl list_queues name messages consumers memory | sort -k2 -n -r | head

Listing queues ...
reply_8f3c...   148210  1  402653184
conductor       9921    24 33554432

A reply queue backed up into six figures points at a stuck/slow consumer rather than client config.

Diagnostic Workflow

Step 1: Confirm it is heartbeats, not a hard outage

# Kolla-Ansible
docker logs rabbitmq 2>&1 | grep -i "missed heartbeats" | tail -5
# Traditional packages
sudo journalctl -u rabbitmq-server --no-pager | grep -i "missed heartbeats" | tail -5

Repeated missed heartbeats from client (not connection_closed_abruptly) confirms the heartbeat path is the issue.

Step 2: Check cluster health and partitions first

sudo rabbitmqctl cluster_status
sudo rabbitmqctl list_queues name messages consumers | sort -k2 -n -r | head

A partition or a runaway queue is a broker-side problem; fix it before touching client timeouts.

Step 3: Inspect the connection churn

sudo rabbitmqctl list_connections name user state channels | sort -k4 -n -r | head
watch -n2 'sudo rabbitmqctl list_connections | wc -l'

A connection count that oscillates by tens every few seconds is the churn signature.

Step 4: Check the client service for blocking and current settings

# Kolla-Ansible (on the compute)
docker logs nova_compute 2>&1 | grep -iE "unreachable|Reconnected to AMQP" | tail -20
# Traditional
sudo journalctl -u nova-compute --no-pager | grep -iE "unreachable|Reconnected" | tail -20
grep -E 'heartbeat_timeout_threshold|heartbeat_in_pthread|kombu_reconnect_delay' /etc/nova/nova.conf

Correlate the unreachable timestamps with slow operations in the same log.

Step 5: Verify fd limits and network keepalive, then tune

sudo rabbitmqctl status | grep -A3 file_descriptors
cat /proc/$(pgrep -f beam.smp | head -1)/limits | grep "open files"
sudo sysctl net.ipv4.tcp_keepalive_time

After raising limits or adjusting timeouts, restart the affected service and re-watch the churn.

Example Root Cause Analysis

After adding ten compute nodes, controllers begin logging waves of missed heartbeats from client, timeout: 60 seconds, and nova-compute across the fleet logs AMQP server is unreachable every few minutes.

cluster_status shows no partitions, and queue depths are normal, so the broker itself is healthy. The connection list, however, churns constantly:

sudo rabbitmqctl list_connections state | sort | uniq -c

   41 closing
  612 running

Checking the broker’s file descriptors:

{file_descriptors,[{total_limit,4096},{total_used,4061},{sockets_limit,3680},{sockets_used,3679}]}

The default LimitNOFILE was never raised after the fleet grew, so RabbitMQ is starved of sockets and dropping connections, which the clients see as missed heartbeats.

Fix: raise the broker fd limit and restart, then move heartbeats off eventlet on the clients:

# Kolla: set rabbitmq LimitNOFILE / ulimits, then
docker restart rabbitmq
# On computes: heartbeat_in_pthread = true, heartbeat_timeout_threshold = 60
docker restart nova_compute

Socket usage drops well below the new ceiling, the connection count stabilizes, and the missed-heartbeat errors stop.

Prevention Best Practices

Set heartbeat_in_pthread = true for eventlet-based services so a blocked green thread cannot starve heartbeats.
Size LimitNOFILE for RabbitMQ to comfortably exceed connections + channels for the whole fleet, and alert when sockets_used passes ~80% of sockets_limit.
Keep heartbeat_timeout_threshold (default 60s) shorter than any firewall/conntrack idle timeout, and set TCP keepalive accordingly so idle sockets are never silently dropped.
Monitor rabbitmqctl cluster_status for partitions and mirror/quorum critical queues so an HA event does not cascade into fleet-wide heartbeat loss.
Watch reply-queue depth with rabbitmqctl list_queues name messages consumers; a backing-up reply_* queue is an early sign of a stuck consumer before churn begins.
For ad-hoc triage, the free incident assistant can correlate missed heartbeats and unreachable log spikes into a likely cause. See more in OpenStack guides.

Quick Command Reference

# Confirm missed heartbeats on the broker
docker logs rabbitmq 2>&1 | grep -i "missed heartbeats" | tail -5
sudo journalctl -u rabbitmq-server | grep -i "missed heartbeats" | tail -5

# Cluster, partitions, and queue depth
sudo rabbitmqctl cluster_status
sudo rabbitmqctl list_queues name messages consumers memory | sort -k2 -n -r | head

# Connection churn
sudo rabbitmqctl list_connections name user state channels
watch -n2 'sudo rabbitmqctl list_connections | wc -l'

# File descriptors / socket limits
sudo rabbitmqctl status | grep -A3 file_descriptors

# Client-side errors and current oslo.messaging tuning
docker logs nova_compute 2>&1 | grep -iE "unreachable|Reconnected to AMQP" | tail -20
grep -E 'heartbeat_timeout_threshold|heartbeat_in_pthread|kombu_reconnect_delay' /etc/nova/nova.conf

# Restart after tuning
docker restart rabbitmq nova_compute
sudo systemctl restart rabbitmq-server nova-compute

Conclusion

A missed heartbeats from client, timeout: 60 seconds error means an OpenStack service failed to send its AMQP heartbeat before the broker’s deadline, so RabbitMQ closed the connection and the client logged the server as unreachable. The usual root causes:

A heartbeat_timeout_threshold too aggressive for how busy the client gets.
eventlet/native-thread blocking that starves the heartbeat frame.
A firewall, conntrack, or load balancer dropping idle AMQP TCP sockets.
A RabbitMQ cluster partition or unsynced HA queues.
File-descriptor/socket limits exhausted on the broker.
Overloaded controllers or backed-up reply queues.

Rule out broker partitions and fd exhaustion first, then move heartbeats off the blocked event loop and align timeouts with your network — the churn almost always traces to one of those.

OpenStack Error Guide: 'Missed heartbeats from client, timeout: 60 seconds' RabbitMQ Connection Churn