RabbitMQ Error Guide: 'Node rabbit@host is down' Cluster

Exact Error Message

When you run a cluster command and a member is unreachable, RabbitMQ reports the node as down:

Cluster status of node rabbit@mq-01 ...
Basics

Cluster name: rabbit@mq-01

Disk Nodes
rabbit@mq-01
rabbit@mq-02
rabbit@mq-03

Running Nodes
rabbit@mq-01
rabbit@mq-03

Error: unable to connect to node rabbit@mq-02: nodedown

A direct command targeting the dead node returns the classic “not responding” diagnostic block:

Error: unable to perform an operation on node 'rabbit@mq-02'.
Please see diagnostics information and suggestions below.

attempted to contact: [rabbit@mq-02]

rabbit@mq-02:
  * connected to epmd (port 4369) on mq-02
  * epmd reports: node 'rabbit' not running at all
  * suggestion: start the node

What the Error Means

A RabbitMQ node runs as an Erlang VM (the “beam” process). Cluster operations rely on Erlang distribution: each peer must be reachable over epmd (port 4369, for node discovery) and the distribution port (25672, for the actual inter-node traffic), and must share a matching Erlang cookie. “Node rabbit@host is down” / nodedown means a peer that is part of the cluster’s known membership cannot be contacted right now.

The error distinguishes two failure modes that the diagnostic block makes explicit. If epmd answers but reports “node ‘rabbit’ not running at all,” the host is up but the RabbitMQ application/beam is stopped or crashed. If the tool cannot even reach epmd, the host is unreachable or the ports are blocked. Either way the surviving nodes keep the down node in their Disk Nodes list (cluster membership is durable) but omit it from Running Nodes.

Because membership persists, a down node is not automatically forgotten. The cluster will keep waiting for it to return — which is correct, but it also means a permanently-dead node must be explicitly removed, or it will block operations that require all members.

Common Causes

The RabbitMQ service stopped or crashed (OOM kill, an unhandled error, or an operator stopping it) while the host stays up — epmd reports “not running at all.”
The host itself is down or rebooting, so epmd is unreachable.
A firewall or security-group change blocked port 4369 or 25672 between nodes.
The beam crashed on low memory or disk (the memory/disk alarm escalated to a crash).
An Erlang cookie mismatch after a redeploy, so the node is up but distribution authentication fails.
A hostname/DNS change so peers resolve mq-02 to the wrong address.
A still-down node never re-added after a previous incident, leaving stale membership.

How to Reproduce the Error

Stop the RabbitMQ app on one node, then query the cluster from another:

# On mq-02: stop just the RabbitMQ application (beam keeps running)
rabbitmqctl stop_app

# On mq-01: the peer now shows as not running
rabbitmqctl cluster_status

To reproduce the fully-down nodedown form, stop the whole service (systemctl stop rabbitmq-server) on mq-02 and re-run cluster_status from mq-01.

Diagnostic Commands

Start from a surviving node and confirm which members are running:

# Topology and which nodes are up
rabbitmqctl cluster_status

On the suspected-down host, check whether the service and beam are alive:

# Is the service running at all?
systemctl status rabbitmq-server --no-pager

# Recent crash/exit reasons
sudo journalctl -u rabbitmq-server --since '-30min' | grep -iE 'crash|error|killed|oom|alarm|exit' | tail

Test the distribution layer between nodes — epmd first, then the distribution port:

# What does epmd know about the node?
epmd -names

# Reachability of the two ports peers need
nc -vz mq-02 4369     # epmd
nc -vz mq-02 25672    # Erlang distribution

epmd: up and running on port 4369 with data:
name rabbit at port 25672

If the node process is up, confirm distribution health and memory/disk alarms from a healthy node:

rabbitmq-diagnostics -n rabbit@mq-01 alarms
rabbitmq-diagnostics -n rabbit@mq-01 ping

Step-by-Step Resolution

Step 1: Classify the failure from the diagnostic block

If epmd reports “node ‘rabbit’ not running at all,” the host is up but the broker is stopped — go to Step 2. If epmd itself is unreachable, treat it as host/network down — go to Step 4.

Step 2: Restart the broker on the down node

On the affected host, start the service and watch it boot:

systemctl status rabbitmq-server --no-pager
sudo journalctl -u rabbitmq-server -f

If the beam exited cleanly you can start it again with your service manager. If only the app was stopped, rabbitmqctl start_app brings it back into the cluster.

Step 3: If it crashed, find why before restarting

OOM kills and disk-alarm crashes will recur. Check the log for oom, alarm, or repeated restarts and address the resource pressure first, otherwise the node flaps.

Step 4: For an unreachable host, fix network/ports

Confirm the host is up, then verify 4369 and 25672 are open both directions between every pair of nodes. A one-way firewall rule produces a node that some peers see and others do not.

Step 5: Verify the node rejoined

rabbitmqctl cluster_status

The recovered node should reappear under Running Nodes and any partition section should be empty.

Step 6: If the node is permanently dead, remove it

When a host is gone for good, drop it from membership from a surviving node so cluster-wide operations stop waiting on it:

# From a healthy node, after confirming the dead node will not return
rabbitmqctl forget_cluster_node rabbit@mq-02

Prevention and Best Practices

Run an odd number of nodes (3 or 5) so the loss of one member never strands a majority.
Use quorum queues, whose Raft replication tolerates a single node loss without data risk.
Monitor memory and disk watermarks and alert before an alarm escalates to a crash.
Lock down but verify ports 4369 and 25672 between all nodes, in both directions, and re-check after any security-group change.
Pin hostnames and the Erlang cookie in configuration management so redeploys do not silently break distribution.
Alert on Running Nodes count so a missing member pages immediately rather than being discovered during the next deploy.
Have a documented forget_cluster_node runbook for permanently failed hosts.

Mnesia network partition (split brain) — when nodes stay up but lose distribution to each other instead of going fully down.
epmd / NXDOMAIN host resolution — a node that cannot be contacted because its hostname does not resolve.
inconsistent_cluster — a node that comes back up but disagrees about cluster membership.

Frequently Asked Questions

Why does the down node still appear in Disk Nodes? Cluster membership is durable. A node remains a known member until you explicitly forget_cluster_node it, which is what lets it rejoin automatically after a transient outage.

The host is up but epmd says ‘not running at all’ — what does that mean? The machine and epmd are fine, but the RabbitMQ application/beam is stopped or crashed. Restart the service and check the log for the exit reason.

Can I run cluster commands against the down node? No — rabbitmqctl needs to reach the target node over distribution. Run diagnostics from a surviving node and target the dead host only with reachability tools like nc and epmd -names.

Do I need to restart the whole cluster to recover one node? No. Recover the single node (restart the service or fix the network). The surviving majority keeps serving throughout.

When should I use forget_cluster_node? Only when a host is permanently gone and will not return with the same name. Forgetting a node that later comes back will cause it to report an inconsistent cluster.

RabbitMQ Error Guide: 'Node rabbit@host is down' Cluster Member Unreachable

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Step 1: Classify the failure from the diagnostic block

Step 2: Restart the broker on the down node

Step 3: If it crashed, find why before restarting

Step 4: For an unreachable host, fix network/ports

Step 5: Verify the node rejoined

Step 6: If the node is permanently dead, remove it

Prevention and Best Practices

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Step 1: Classify the failure from the diagnostic block

Step 2: Restart the broker on the down node

Step 3: If it crashed, find why before restarting

Step 4: For an unreachable host, fix network/ports

Step 5: Verify the node rejoined

Step 6: If the node is permanently dead, remove it

Prevention and Best Practices

Related Errors

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit