RabbitMQ Error Guide: '{inet_error,etimedout}' Stale Half-Open Connection
Fix RabbitMQ inet_error etimedout half-open connections: vanished clients, disabled heartbeats, TCP keepalive tuning, and NAT idle-timeout drops.
- #rabbitmq
- #troubleshooting
- #errors
- #connectivity
Exact Error Message
When a RabbitMQ broker tries to write to a client that has silently vanished, the operating system’s TCP retransmission timer eventually gives up and the broker logs a connection closure carrying an inet_error reason. The most common variant is etimedout:
=WARNING REPORT==== 24-Jun-2026::14:08:51.442213 ===
closing AMQP connection <0.1234.0> (10.0.5.31:51902 -> 10.0.4.21:5672 - my-worker@10.0.5.31):
{inet_error,etimedout}
=WARNING REPORT==== 24-Jun-2026::14:11:03.918774 ===
closing AMQP connection <0.2087.0> (10.0.5.44:48120 -> 10.0.4.21:5672 - sched-pod-7c9):
{inet_error,ehostunreach}
=WARNING REPORT==== 24-Jun-2026::14:12:40.005611 ===
closing AMQP connection <0.2810.0> (10.0.5.19:39044 -> 10.0.4.21:5672):
{inet_error,econnreset}
On newer releases using the structured logger, the same event appears on one line:
2026-06-24 14:08:51.442 [warning] <0.1234.0> closing AMQP connection <0.1234.0> (10.0.5.31:51902 -> 10.0.4.21:5672 - my-worker@10.0.5.31, vhost: '/', user: 'svc_orders'): {inet_error,etimedout}
The reason atom in braces is the key: etimedout (TCP write timed out), ehostunreach (no route to the peer anymore), or econnreset (an intermediary or a recovered host actively reset the flow).
What the Error Means
{inet_error,etimedout} is reported by the Erlang inet driver, not by AMQP. It means the broker attempted to send bytes to the client’s TCP socket, the kernel queued those bytes and retransmitted them per the standard TCP backoff schedule, and after several minutes with no ACK the kernel returned ETIMEDOUT. RabbitMQ surfaces that errno and tears the connection down.
The critical detail is half-open detection. When a client host crashes, has its power cut, or its NAT/firewall flow is silently dropped, no FIN or RST is ever sent. From the broker’s TCP stack the connection still looks ESTABLISHED. RabbitMQ only discovers the truth the next time it actually writes to that socket, for example when delivering a message to a consumer. The write enters the kernel’s retransmit loop and etimedout only fires after the OS exhausts its retries, which on Linux defaults to roughly 13-15 minutes (net.ipv4.tcp_retries2 = 15). Until then the connection, its channels, and any unacked messages stay stuck.
Two independent mechanisms shorten that window. AMQP heartbeats make both sides exchange empty frames on an interval; a missed beat closes the connection within 2 * heartbeat seconds regardless of pending writes. TCP keepalive probes the socket at the OS level even when the application is idle. If you disable heartbeats (heartbeat = 0) and never enable keepalive, an idle half-open connection can linger indefinitely, and you will only see etimedout whenever the broker finally needs to push data.
Common Causes
- Ungraceful client host loss. A VM, container, or pod is killed (OOM, node drain, hard reboot,
kill -9on the host process namespace) so no TCP shutdown handshake is sent. - NAT or stateful firewall idle drops. A NAT gateway or firewall silently evicts an idle flow from its connection table. Subsequent broker writes go nowhere; depending on the device you get
etimedoutorehostunreach. - Network partition or route withdrawal. A subnet, route, or peering link disappears. The broker’s retransmissions are black-holed and
ehostunreachis returned. - Heartbeats disabled. Clients negotiating
heartbeat = 0(older libraries, explicit misconfiguration) remove the fastest detection path, leaving only slow TCP timeouts. - Aggressive client-side reconnects after a blip. A recovered client opens a new connection while the old socket is still half-open on the broker, so you see a burst of
etimedoutclosures trailing a real outage. - Cloud load balancer in front of the broker that drops idle TCP flows below the configured heartbeat interval.
How to Reproduce the Error
You can reproduce the half-open scenario safely in a lab:
- Start a consumer that subscribes to a queue and negotiates
heartbeat = 0(heartbeats disabled). - On the client host, drop the broker’s traffic at the firewall without sending a reset, simulating a vanished host:
# On the CLIENT host (lab only): silently black-hole the broker.
sudo iptables -A OUTPUT -d 10.0.4.21 -j DROP
- Publish a message so the broker must write to the now-unreachable consumer.
- Watch the broker log. Several minutes later, when the kernel exhausts retransmits, the closure appears:
closing AMQP connection <0.1234.0> (10.0.5.31:51902 -> 10.0.4.21:5672): {inet_error,etimedout}
Remove the lab rule afterward with sudo iptables -D OUTPUT -d 10.0.4.21 -j DROP.
Diagnostic Commands
All commands below are read-only. Start on the broker node by listing connections with their state and timeout, then look at the live sockets.
# Connections with negotiated heartbeat timeout and byte counters.
# A timeout of 0 means heartbeats are disabled for that connection.
sudo rabbitmqctl list_connections name timeout state recv_oct send_oct
Listing connections ...
name timeout state recv_oct send_oct
10.0.5.31:51902 -> ... 0 running 142890 9981233
10.0.5.44:48120 -> ... 60 running 88120 430112
10.0.5.19:39044 -> ... 0 running 10240 2210945
The rows with timeout = 0 are the dangerous ones: no heartbeat watchdog, so only TCP can detect a dead peer.
# Established sockets on the AMQP port, with owning process.
sudo ss -tnp state established '( sport = :5672 )'
# Show per-socket timers, including keepalive countdown if enabled.
sudo ss -tno state established '( sport = :5672 )'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 262144 10.0.4.21:5672 10.0.5.31:51902 timer:(on,4min12sec,9)
ESTAB 0 0 10.0.4.21:5672 10.0.5.44:48120 timer:(keepalive,118sec,0)
A large Send-Q with an on (retransmit) timer and a rising retry count is the smoking gun for a half-open peer. A keepalive timer means OS keepalive is active on that socket.
# Read the host keepalive settings (read-only).
sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_probes
sysctl net.ipv4.tcp_retries2
# Broker health and listeners.
sudo rabbitmq-diagnostics status
# Search the logs for the inet_error family over time.
sudo grep -E "inet_error,(etimedout|ehostunreach|econnreset)" /var/log/rabbitmq/*.log
If net.ipv4.tcp_keepalive_time reads 7200, the kernel waits two hours before its first keepalive probe, which is far too long to be useful on its own.
Step-by-Step Resolution
-
Confirm the pattern. Run
list_connections ... timeout stateand thess -tnochecks. If you seetimeout = 0rows and sockets stuck with retransmit timers, you have classic half-open lingering. -
Re-enable heartbeats. Set a sane heartbeat on both the broker and clients. In
rabbitmq.conf:heartbeat = 60Restart clients so they renegotiate. Many libraries let the server-proposed value win only if the client requests a non-zero value, so fix the client config too.
-
Enable TCP keepalive for the listener as a backstop for idle connections, in
rabbitmq.conf:tcp_listen_options.keepalive = true -
Tighten the host keepalive timing so probes start in minutes, not hours. Apply on the broker (and ideally clients) and persist via
/etc/sysctl.d/:net.ipv4.tcp_keepalive_time = 120 net.ipv4.tcp_keepalive_intvl = 30 net.ipv4.tcp_keepalive_probes = 4 -
Align with network idle timeouts. Make sure your heartbeat and keepalive intervals are shorter than the idle-flow timeout of any NAT gateway, load balancer, or firewall between clients and the broker.
-
Verify recovery. After redeploying clients, re-run
list_connections name timeoutand confirm new connections show a non-zerotimeout, and thatss -tnoshowskeepalivetimers instead of long-livedonretransmit timers.
For correlating these closures with publish/consume errors across a fleet, the incident response dashboard can group the inet_error log lines by client subnet.
Prevention and Best Practices
- Never run
heartbeat = 0in production. A 30-60 second heartbeat is cheap and catches dead peers fast. - Layer keepalive under heartbeats. Heartbeats cover the AMQP layer; OS keepalive covers idle TCP and intermediaries that strip or ignore application traffic.
- Keep timers below every idle timeout in the path. NAT, LB, and firewall idle windows are the usual silent killers; pick heartbeat and keepalive values well under the smallest one.
- Use durable, acknowledged consumers so messages stuck on a half-open connection are redelivered after the closure rather than lost.
- Alert on
timeout = 0connections and on bursts ofinet_error,etimedoutin the broker log; both indicate misconfigured clients or a flaky path. - Drain nodes gracefully so clients send a clean disconnect instead of vanishing.
Related Errors
- Missed heartbeats (
missed heartbeats from client, timeout: 60s) is the faster sibling of this error; when heartbeats are enabled, you usually see that message instead ofinet_error,etimedout. See the dedicated missed-heartbeats guide for that path. connection_closed_abruptlyappears when a client drops the TCP connection during the AMQP handshake without a clean close, often from health-check probes or aborted connects.CONNECTION_FORCEDis a broker- or operator-initiated closure (node shutdown,close_connection, or policy), distinct from the peer-vanished case here.
Browse more broker issues under RabbitMQ troubleshooting.
Frequently Asked Questions
Why does it take minutes for the error to appear?
Because the broker only learns the peer is gone when it writes and the kernel exhausts TCP retransmissions. On Linux, net.ipv4.tcp_retries2 = 15 translates to roughly 13-15 minutes of retries before ETIMEDOUT is returned.
Does setting heartbeat = 0 cause this directly?
Not by itself, but it removes the fast detection path. With heartbeats disabled, the broker has no AMQP-level liveness check, so a vanished client is only discovered through slow TCP timeouts, exactly when you see inet_error,etimedout.
What is the difference between etimedout and ehostunreach?
etimedout means writes were sent but never acknowledged (peer silent). ehostunreach means the kernel has no route to the peer at all, typically after a route withdrawal or NAT eviction returning ICMP unreachable.
Will TCP keepalive replace heartbeats? No. Keepalive is a useful backstop for idle and intermediary cases, but AMQP heartbeats are application-aware and usually detect failures faster. Run both.
Are messages lost when this happens? Messages delivered to a consumer that did not acknowledge them are requeued when the half-open connection finally closes, provided you use manual acks and durable queues. Auto-ack consumers can lose in-flight messages.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.