Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for RabbitMQ By James Joyner IV · · 9 min read

RabbitMQ Error Guide: '{inet_error,etimedout}' Stale Half-Open Connection

Fix RabbitMQ inet_error etimedout half-open connections: vanished clients, disabled heartbeats, TCP keepalive tuning, and NAT idle-timeout drops.

  • #rabbitmq
  • #troubleshooting
  • #errors
  • #connectivity

Exact Error Message

When a RabbitMQ broker tries to write to a client that has silently vanished, the operating system’s TCP retransmission timer eventually gives up and the broker logs a connection closure carrying an inet_error reason. The most common variant is etimedout:

=WARNING REPORT==== 24-Jun-2026::14:08:51.442213 ===
closing AMQP connection <0.1234.0> (10.0.5.31:51902 -> 10.0.4.21:5672 - my-worker@10.0.5.31):
{inet_error,etimedout}

=WARNING REPORT==== 24-Jun-2026::14:11:03.918774 ===
closing AMQP connection <0.2087.0> (10.0.5.44:48120 -> 10.0.4.21:5672 - sched-pod-7c9):
{inet_error,ehostunreach}

=WARNING REPORT==== 24-Jun-2026::14:12:40.005611 ===
closing AMQP connection <0.2810.0> (10.0.5.19:39044 -> 10.0.4.21:5672):
{inet_error,econnreset}

On newer releases using the structured logger, the same event appears on one line:

2026-06-24 14:08:51.442 [warning] <0.1234.0> closing AMQP connection <0.1234.0> (10.0.5.31:51902 -> 10.0.4.21:5672 - my-worker@10.0.5.31, vhost: '/', user: 'svc_orders'): {inet_error,etimedout}

The reason atom in braces is the key: etimedout (TCP write timed out), ehostunreach (no route to the peer anymore), or econnreset (an intermediary or a recovered host actively reset the flow).

What the Error Means

{inet_error,etimedout} is reported by the Erlang inet driver, not by AMQP. It means the broker attempted to send bytes to the client’s TCP socket, the kernel queued those bytes and retransmitted them per the standard TCP backoff schedule, and after several minutes with no ACK the kernel returned ETIMEDOUT. RabbitMQ surfaces that errno and tears the connection down.

The critical detail is half-open detection. When a client host crashes, has its power cut, or its NAT/firewall flow is silently dropped, no FIN or RST is ever sent. From the broker’s TCP stack the connection still looks ESTABLISHED. RabbitMQ only discovers the truth the next time it actually writes to that socket, for example when delivering a message to a consumer. The write enters the kernel’s retransmit loop and etimedout only fires after the OS exhausts its retries, which on Linux defaults to roughly 13-15 minutes (net.ipv4.tcp_retries2 = 15). Until then the connection, its channels, and any unacked messages stay stuck.

Two independent mechanisms shorten that window. AMQP heartbeats make both sides exchange empty frames on an interval; a missed beat closes the connection within 2 * heartbeat seconds regardless of pending writes. TCP keepalive probes the socket at the OS level even when the application is idle. If you disable heartbeats (heartbeat = 0) and never enable keepalive, an idle half-open connection can linger indefinitely, and you will only see etimedout whenever the broker finally needs to push data.

Common Causes

  • Ungraceful client host loss. A VM, container, or pod is killed (OOM, node drain, hard reboot, kill -9 on the host process namespace) so no TCP shutdown handshake is sent.
  • NAT or stateful firewall idle drops. A NAT gateway or firewall silently evicts an idle flow from its connection table. Subsequent broker writes go nowhere; depending on the device you get etimedout or ehostunreach.
  • Network partition or route withdrawal. A subnet, route, or peering link disappears. The broker’s retransmissions are black-holed and ehostunreach is returned.
  • Heartbeats disabled. Clients negotiating heartbeat = 0 (older libraries, explicit misconfiguration) remove the fastest detection path, leaving only slow TCP timeouts.
  • Aggressive client-side reconnects after a blip. A recovered client opens a new connection while the old socket is still half-open on the broker, so you see a burst of etimedout closures trailing a real outage.
  • Cloud load balancer in front of the broker that drops idle TCP flows below the configured heartbeat interval.

How to Reproduce the Error

You can reproduce the half-open scenario safely in a lab:

  1. Start a consumer that subscribes to a queue and negotiates heartbeat = 0 (heartbeats disabled).
  2. On the client host, drop the broker’s traffic at the firewall without sending a reset, simulating a vanished host:
# On the CLIENT host (lab only): silently black-hole the broker.
sudo iptables -A OUTPUT -d 10.0.4.21 -j DROP
  1. Publish a message so the broker must write to the now-unreachable consumer.
  2. Watch the broker log. Several minutes later, when the kernel exhausts retransmits, the closure appears:
closing AMQP connection <0.1234.0> (10.0.5.31:51902 -> 10.0.4.21:5672): {inet_error,etimedout}

Remove the lab rule afterward with sudo iptables -D OUTPUT -d 10.0.4.21 -j DROP.

Diagnostic Commands

All commands below are read-only. Start on the broker node by listing connections with their state and timeout, then look at the live sockets.

# Connections with negotiated heartbeat timeout and byte counters.
# A timeout of 0 means heartbeats are disabled for that connection.
sudo rabbitmqctl list_connections name timeout state recv_oct send_oct
Listing connections ...
name                    timeout state    recv_oct send_oct
10.0.5.31:51902 -> ...   0       running  142890   9981233
10.0.5.44:48120 -> ...   60      running  88120    430112
10.0.5.19:39044 -> ...   0       running  10240    2210945

The rows with timeout = 0 are the dangerous ones: no heartbeat watchdog, so only TCP can detect a dead peer.

# Established sockets on the AMQP port, with owning process.
sudo ss -tnp state established '( sport = :5672 )'

# Show per-socket timers, including keepalive countdown if enabled.
sudo ss -tno state established '( sport = :5672 )'
State  Recv-Q Send-Q   Local Address:Port   Peer Address:Port
ESTAB  0      262144   10.0.4.21:5672       10.0.5.31:51902  timer:(on,4min12sec,9)
ESTAB  0      0        10.0.4.21:5672       10.0.5.44:48120  timer:(keepalive,118sec,0)

A large Send-Q with an on (retransmit) timer and a rising retry count is the smoking gun for a half-open peer. A keepalive timer means OS keepalive is active on that socket.

# Read the host keepalive settings (read-only).
sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_probes
sysctl net.ipv4.tcp_retries2

# Broker health and listeners.
sudo rabbitmq-diagnostics status

# Search the logs for the inet_error family over time.
sudo grep -E "inet_error,(etimedout|ehostunreach|econnreset)" /var/log/rabbitmq/*.log

If net.ipv4.tcp_keepalive_time reads 7200, the kernel waits two hours before its first keepalive probe, which is far too long to be useful on its own.

Step-by-Step Resolution

  1. Confirm the pattern. Run list_connections ... timeout state and the ss -tno checks. If you see timeout = 0 rows and sockets stuck with retransmit timers, you have classic half-open lingering.

  2. Re-enable heartbeats. Set a sane heartbeat on both the broker and clients. In rabbitmq.conf:

    heartbeat = 60

    Restart clients so they renegotiate. Many libraries let the server-proposed value win only if the client requests a non-zero value, so fix the client config too.

  3. Enable TCP keepalive for the listener as a backstop for idle connections, in rabbitmq.conf:

    tcp_listen_options.keepalive = true
  4. Tighten the host keepalive timing so probes start in minutes, not hours. Apply on the broker (and ideally clients) and persist via /etc/sysctl.d/:

    net.ipv4.tcp_keepalive_time = 120
    net.ipv4.tcp_keepalive_intvl = 30
    net.ipv4.tcp_keepalive_probes = 4
  5. Align with network idle timeouts. Make sure your heartbeat and keepalive intervals are shorter than the idle-flow timeout of any NAT gateway, load balancer, or firewall between clients and the broker.

  6. Verify recovery. After redeploying clients, re-run list_connections name timeout and confirm new connections show a non-zero timeout, and that ss -tno shows keepalive timers instead of long-lived on retransmit timers.

For correlating these closures with publish/consume errors across a fleet, the incident response dashboard can group the inet_error log lines by client subnet.

Prevention and Best Practices

  • Never run heartbeat = 0 in production. A 30-60 second heartbeat is cheap and catches dead peers fast.
  • Layer keepalive under heartbeats. Heartbeats cover the AMQP layer; OS keepalive covers idle TCP and intermediaries that strip or ignore application traffic.
  • Keep timers below every idle timeout in the path. NAT, LB, and firewall idle windows are the usual silent killers; pick heartbeat and keepalive values well under the smallest one.
  • Use durable, acknowledged consumers so messages stuck on a half-open connection are redelivered after the closure rather than lost.
  • Alert on timeout = 0 connections and on bursts of inet_error,etimedout in the broker log; both indicate misconfigured clients or a flaky path.
  • Drain nodes gracefully so clients send a clean disconnect instead of vanishing.
  • Missed heartbeats (missed heartbeats from client, timeout: 60s) is the faster sibling of this error; when heartbeats are enabled, you usually see that message instead of inet_error,etimedout. See the dedicated missed-heartbeats guide for that path.
  • connection_closed_abruptly appears when a client drops the TCP connection during the AMQP handshake without a clean close, often from health-check probes or aborted connects.
  • CONNECTION_FORCED is a broker- or operator-initiated closure (node shutdown, close_connection, or policy), distinct from the peer-vanished case here.

Browse more broker issues under RabbitMQ troubleshooting.

Frequently Asked Questions

Why does it take minutes for the error to appear? Because the broker only learns the peer is gone when it writes and the kernel exhausts TCP retransmissions. On Linux, net.ipv4.tcp_retries2 = 15 translates to roughly 13-15 minutes of retries before ETIMEDOUT is returned.

Does setting heartbeat = 0 cause this directly? Not by itself, but it removes the fast detection path. With heartbeats disabled, the broker has no AMQP-level liveness check, so a vanished client is only discovered through slow TCP timeouts, exactly when you see inet_error,etimedout.

What is the difference between etimedout and ehostunreach? etimedout means writes were sent but never acknowledged (peer silent). ehostunreach means the kernel has no route to the peer at all, typically after a route withdrawal or NAT eviction returning ICMP unreachable.

Will TCP keepalive replace heartbeats? No. Keepalive is a useful backstop for idle and intermediary cases, but AMQP heartbeats are application-aware and usually detect failures faster. Run both.

Are messages lost when this happens? Messages delivered to a consumer that did not acknowledge them are requeued when the half-open connection finally closes, provided you use manual acks and durable queues. Auto-ack consumers can lose in-flight messages.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.