RabbitMQ Error Guide: 'Node rabbit@host is down' Cluster Member Unreachable
Fix RabbitMQ 'Node rabbit@host is down' and 'not responding' errors: a crashed beam, stopped service, or blocked distribution ports. Diagnose and recover safely.
- #rabbitmq
- #troubleshooting
- #errors
- #clustering
Exact Error Message
When you run a cluster command and a member is unreachable, RabbitMQ reports the node as down:
Cluster status of node rabbit@mq-01 ...
Basics
Cluster name: rabbit@mq-01
Disk Nodes
rabbit@mq-01
rabbit@mq-02
rabbit@mq-03
Running Nodes
rabbit@mq-01
rabbit@mq-03
Error: unable to connect to node rabbit@mq-02: nodedown
A direct command targeting the dead node returns the classic “not responding” diagnostic block:
Error: unable to perform an operation on node 'rabbit@mq-02'.
Please see diagnostics information and suggestions below.
attempted to contact: [rabbit@mq-02]
rabbit@mq-02:
* connected to epmd (port 4369) on mq-02
* epmd reports: node 'rabbit' not running at all
* suggestion: start the node
What the Error Means
A RabbitMQ node runs as an Erlang VM (the “beam” process). Cluster operations rely on Erlang distribution: each peer must be reachable over epmd (port 4369, for node discovery) and the distribution port (25672, for the actual inter-node traffic), and must share a matching Erlang cookie. “Node rabbit@host is down” / nodedown means a peer that is part of the cluster’s known membership cannot be contacted right now.
The error distinguishes two failure modes that the diagnostic block makes explicit. If epmd answers but reports “node ‘rabbit’ not running at all,” the host is up but the RabbitMQ application/beam is stopped or crashed. If the tool cannot even reach epmd, the host is unreachable or the ports are blocked. Either way the surviving nodes keep the down node in their Disk Nodes list (cluster membership is durable) but omit it from Running Nodes.
Because membership persists, a down node is not automatically forgotten. The cluster will keep waiting for it to return — which is correct, but it also means a permanently-dead node must be explicitly removed, or it will block operations that require all members.
Common Causes
- The RabbitMQ service stopped or crashed (OOM kill, an unhandled error, or an operator stopping it) while the host stays up — epmd reports “not running at all.”
- The host itself is down or rebooting, so epmd is unreachable.
- A firewall or security-group change blocked port 4369 or 25672 between nodes.
- The beam crashed on low memory or disk (the memory/disk alarm escalated to a crash).
- An Erlang cookie mismatch after a redeploy, so the node is up but distribution authentication fails.
- A hostname/DNS change so peers resolve
mq-02to the wrong address. - A still-down node never re-added after a previous incident, leaving stale membership.
How to Reproduce the Error
Stop the RabbitMQ app on one node, then query the cluster from another:
# On mq-02: stop just the RabbitMQ application (beam keeps running)
rabbitmqctl stop_app
# On mq-01: the peer now shows as not running
rabbitmqctl cluster_status
To reproduce the fully-down nodedown form, stop the whole service (systemctl stop rabbitmq-server) on mq-02 and re-run cluster_status from mq-01.
Diagnostic Commands
Start from a surviving node and confirm which members are running:
# Topology and which nodes are up
rabbitmqctl cluster_status
On the suspected-down host, check whether the service and beam are alive:
# Is the service running at all?
systemctl status rabbitmq-server --no-pager
# Recent crash/exit reasons
sudo journalctl -u rabbitmq-server --since '-30min' | grep -iE 'crash|error|killed|oom|alarm|exit' | tail
Test the distribution layer between nodes — epmd first, then the distribution port:
# What does epmd know about the node?
epmd -names
# Reachability of the two ports peers need
nc -vz mq-02 4369 # epmd
nc -vz mq-02 25672 # Erlang distribution
epmd: up and running on port 4369 with data:
name rabbit at port 25672
If the node process is up, confirm distribution health and memory/disk alarms from a healthy node:
rabbitmq-diagnostics -n rabbit@mq-01 alarms
rabbitmq-diagnostics -n rabbit@mq-01 ping
Step-by-Step Resolution
Step 1: Classify the failure from the diagnostic block
If epmd reports “node ‘rabbit’ not running at all,” the host is up but the broker is stopped — go to Step 2. If epmd itself is unreachable, treat it as host/network down — go to Step 4.
Step 2: Restart the broker on the down node
On the affected host, start the service and watch it boot:
systemctl status rabbitmq-server --no-pager
sudo journalctl -u rabbitmq-server -f
If the beam exited cleanly you can start it again with your service manager. If only the app was stopped, rabbitmqctl start_app brings it back into the cluster.
Step 3: If it crashed, find why before restarting
OOM kills and disk-alarm crashes will recur. Check the log for oom, alarm, or repeated restarts and address the resource pressure first, otherwise the node flaps.
Step 4: For an unreachable host, fix network/ports
Confirm the host is up, then verify 4369 and 25672 are open both directions between every pair of nodes. A one-way firewall rule produces a node that some peers see and others do not.
Step 5: Verify the node rejoined
rabbitmqctl cluster_status
The recovered node should reappear under Running Nodes and any partition section should be empty.
Step 6: If the node is permanently dead, remove it
When a host is gone for good, drop it from membership from a surviving node so cluster-wide operations stop waiting on it:
# From a healthy node, after confirming the dead node will not return
rabbitmqctl forget_cluster_node rabbit@mq-02
Prevention and Best Practices
- Run an odd number of nodes (3 or 5) so the loss of one member never strands a majority.
- Use quorum queues, whose Raft replication tolerates a single node loss without data risk.
- Monitor memory and disk watermarks and alert before an alarm escalates to a crash.
- Lock down but verify ports 4369 and 25672 between all nodes, in both directions, and re-check after any security-group change.
- Pin hostnames and the Erlang cookie in configuration management so redeploys do not silently break distribution.
- Alert on
Running Nodescount so a missing member pages immediately rather than being discovered during the next deploy. - Have a documented
forget_cluster_noderunbook for permanently failed hosts.
Related Errors
- Mnesia network partition (split brain) — when nodes stay up but lose distribution to each other instead of going fully down.
- epmd / NXDOMAIN host resolution — a node that cannot be contacted because its hostname does not resolve.
inconsistent_cluster— a node that comes back up but disagrees about cluster membership.
Frequently Asked Questions
Why does the down node still appear in Disk Nodes? Cluster membership is durable. A node remains a known member until you explicitly forget_cluster_node it, which is what lets it rejoin automatically after a transient outage.
The host is up but epmd says ‘not running at all’ — what does that mean? The machine and epmd are fine, but the RabbitMQ application/beam is stopped or crashed. Restart the service and check the log for the exit reason.
Can I run cluster commands against the down node? No — rabbitmqctl needs to reach the target node over distribution. Run diagnostics from a surviving node and target the dead host only with reachability tools like nc and epmd -names.
Do I need to restart the whole cluster to recover one node? No. Recover the single node (restart the service or fix the network). The surviving majority keeps serving throughout.
When should I use forget_cluster_node? Only when a host is permanently gone and will not return with the same name. Forgetting a node that later comes back will cause it to report an inconsistent cluster.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.