RabbitMQ Error Guide: 'timeout_waiting_for_tables' Cluster Startup Failure
Fix RabbitMQ timeout_waiting_for_tables on startup: node boot order, last-disc-node down, Mnesia table sync, and forget_cluster_node recovery.
- #rabbitmq
- #troubleshooting
- #errors
- #clustering
Exact Error Message
When a clustered RabbitMQ node cannot synchronise its Mnesia tables on boot, the broker refuses to start and logs a fatal error. The full crash typically looks like this:
2026-06-24 09:12:41.882 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2026-06-24 09:13:11.883 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,
[rabbit@node2],
[rabbit_durable_queue,rabbit_durable_exchange,rabbit_runtime_parameters]}
2026-06-24 09:18:12.014 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
BOOT FAILED
===========
Error during startup: {error,
{timeout_waiting_for_tables,
[rabbit@node1,rabbit@node2],
[rabbit_durable_queue,rabbit_durable_exchange,
rabbit_runtime_parameters]}}
2026-06-24 09:18:12.117 [error] <0.272.0> CRASH REPORT Process <0.272.0> with 0 neighbours crashed with reason:
{{timeout_waiting_for_tables,[rabbit@node1,rabbit@node2],
[rabbit_durable_queue,rabbit_durable_exchange]},...}
2026-06-24 09:18:12.119 [error] <0.271.0> Supervisor rabbit_prelaunch_sup had child prelaunch started with
rabbit_prelaunch:run_prelaunch_first_phase() at undefined exit with reason
{timeout_waiting_for_tables,...} in context start_error
2026-06-24 09:18:12.130 [error] <0.270.0> Runtime terminating during boot ({timeout_waiting_for_tables})
init terminating in do_boot ({timeout_waiting_for_tables,...})
You may also see this surfaced by systemd as rabbitmq-server.service: Main process exited, code=exited, status=70/... followed by the unit entering a failed state.
What the Error Means
RabbitMQ stores cluster metadata — durable queues, exchanges, bindings, users, vhosts, and runtime parameters — in Mnesia, Erlang’s distributed database. In a cluster, disc nodes each hold a copy of this schema. When a node boots, it must agree on the current state of these tables with the other nodes it remembers being clustered with.
The error timeout_waiting_for_tables means the booting node waited (by default 30 seconds per retry, 10 retries, so up to ~5 minutes) for one or more peer nodes to come online and share their Mnesia table copies — and they never appeared. Because the node cannot safely guess which copy is authoritative, it aborts the boot rather than risk diverging schema or losing data.
The list inside the error — [rabbit@node1,rabbit@node2] — is the set of nodes whose tables the booting node is still waiting on. The second list names the actual Mnesia tables that failed to sync.
The root principle: the last node to stop must be the first node to start. That node holds the most recent view of the cluster, and every other node defers to it on boot.
Common Causes
- Wrong boot order after a full cluster shutdown. If
node2was the last to stop but you startnode1first,node1waits fornode2to confirm the schema — and times out ifnode2stays down. - The last-stopped disc node is down or unrecoverable. Hardware failure, a deleted VM, or a corrupted data directory on the authoritative node leaves survivors waiting forever.
- Network connectivity or DNS issues between nodes, so the booting node cannot reach a peer that is actually running (blocked port 4369/epmd, 25672 for inter-node traffic, or unresolvable hostnames).
- Erlang cookie mismatch. If
/var/lib/rabbitmq/.erlang.cookiediffers between nodes, they cannot authenticate to each other, so tables never sync. - Hostname changes. RabbitMQ node identity is tied to its hostname (
rabbit@node1). Renaming the host orphans the old node name in the cluster schema. - A simultaneous power loss where no node recorded a clean shutdown, leaving ambiguity about which node is most current.
How to Reproduce the Error
You can recreate this safely in a lab to understand the boot-order dependency:
- Build a two-node cluster:
rabbit@node1andrabbit@node2, both as disc nodes. - Declare a durable queue so there is schema to sync.
- Stop
rabbit@node1first, then stoprabbit@node2. Nownode2is the last-stopped (authoritative) node. - Start
rabbit@node1first whilenode2stays offline.
node1 will log Waiting for Mnesia tables, retry until exhausted, and finally print BOOT FAILED with timeout_waiting_for_tables,[rabbit@node2]. It is waiting for node2 because node2 held the newest schema view.
Diagnostic Commands
Start by confirming what each node thinks the cluster looks like and whether peers are reachable. These are all read-only.
# Cluster membership and which nodes are running (run on any reachable node)
rabbitmqctl cluster_status
# Lightweight health and Erlang distribution checks
rabbitmq-diagnostics ping
rabbitmq-diagnostics status
rabbitmq-diagnostics check_running
# Full broker status including listeners and partitions
rabbitmqctl status
# Tail the boot log for the timeout error and the node list it is waiting on
grep -i "timeout_waiting_for_tables\|Waiting for Mnesia\|BOOT FAILED" \
/var/log/rabbitmq/rabbit@$(hostname -s).log
# systemd-level view of the failed start
journalctl -u rabbitmq-server --no-pager -n 100
# Confirm epmd (4369) and inter-node port (25672) are listening / reachable
ss -ltnp | grep -E ':4369|:25672'
Healthy cluster_status output on a working two-node cluster looks like this:
Cluster status of node rabbit@node1 ...
Basics
Cluster name: rabbit@node1
Disk Nodes
rabbit@node1
rabbit@node2
Running Nodes
rabbit@node1
rabbit@node2
Versions
rabbit@node1: RabbitMQ 3.13.2 on Erlang 26.2.5
rabbit@node2: RabbitMQ 3.13.2 on Erlang 26.2.5
Maintenance status
Node: rabbit@node1, status: not under maintenance
Node: rabbit@node2, status: not under maintenance
If node2 is failed, you would instead see it listed under Disk Nodes but absent from Running Nodes, confirming the survivor is waiting on an offline peer.
Step-by-Step Resolution
1. Identify the last-stopped node. Check shutdown timestamps in each node’s log. The node that logged its Stopping RabbitMQ / shutdown message last is authoritative. Search the logs:
grep -i "Stopping RabbitMQ\|stopped\|Successfully stopped" \
/var/log/rabbitmq/rabbit@$(hostname -s).log
2. Start nodes in the correct order. Stop the failing node, then start the last-stopped node first. Once it is fully up, start the remaining nodes. They will find the authoritative schema and sync immediately. In most cases this alone resolves the timeout — no data changes required.
3. If the last-stopped node is recoverable but slow, simply give it more time or bring it online before the others. You can verify it is up with rabbitmq-diagnostics check_running before starting peers.
4. If the last-stopped node is permanently gone (dead hardware, deleted VM), the survivors will wait forever for a node that will never return. In that case you must tell the cluster to forget the dead node so the survivors can elect a new authoritative schema. This is a deliberate recovery action — run it from a running, reachable node against the dead node name:
# RECOVERY ACTION (run intentionally, from a healthy node, targeting the DEAD node):
# Removes the unrecoverable node so survivors can boot without waiting on it.
rabbitmqctl forget_cluster_node rabbit@node2
If no node is currently running (every node is stuck in the timeout), you may need to forget the dead node in offline mode by adding the --offline flag while the rest of the cluster is down. Always target only the node you have confirmed is unrecoverable.
5. Rule out connectivity and authentication. If the peer is actually up but unreachable, confirm the Erlang cookie matches across nodes, hostnames resolve, and ports 4369 and 25672 are open. Use rabbitmq-diagnostics ping rabbit@node2 from the booting host to test the Erlang distribution path.
6. Verify recovery. Once the cluster is back, confirm membership and that all nodes are running:
rabbitmqctl cluster_status
rabbitmq-diagnostics check_running
For production incidents where you want a guided runbook and post-mortem timeline, our incident response workspace can capture the boot-order sequence and remediation steps automatically.
Prevention and Best Practices
- Always shut down and start up in a known order. Document which node is your primary disc node and start it first. Scripts that stop nodes in reverse start-order make recovery predictable.
- Avoid full simultaneous shutdowns. Rolling restarts keep a quorum of nodes online so the schema is never ambiguous.
- Use three or more disc nodes rather than two, so the loss of one node never blocks the survivors on a single point of failure.
- Tune the wait if appropriate. The environment variable
RABBITMQ_MNESIA_DIRaside, you can raise the table-wait retry budget via themnesia_table_loading_retry_timeoutandmnesia_table_loading_retry_limitadvanced config keys to tolerate slower peers, but this is a band-aid, not a fix for a dead node. - Pin hostnames. Use stable DNS names or
/etc/hostsentries so node identity never drifts after a reboot. - Back up the Mnesia directory and definitions (
rabbitmqctl export_definitions) regularly so you can rebuild schema if a disc node is lost. - Monitor cluster health continuously so you catch a node that stopped without a clean shutdown before the next restart turns it into a boot failure.
Related Errors
Mnesia network partition/ partition detected — when nodes lose contact mid-run and each side keeps writing, you get a split-brain partition reported inrabbitmqctl cluster_statusunder aNetwork Partitionssection. This is the running-cluster cousin of the boot-time timeout.inconsistent_database— Mnesia detects that two nodes have conflicting schema versions, often after a partition or an out-of-order restart. It surfaces asrunning_partitioned_networkorinconsistent_databasein the logs and usually requires choosing a winning node.Node rabbit@nodeX not running/nodedown— clustering commands fail because the target node is unreachable, frequently caused by the same epmd, cookie, or hostname problems behind atimeout_waiting_for_tables.
Frequently Asked Questions
Why does RabbitMQ care which node started first? Because Mnesia has no central coordinator. The last node to stop holds the most recent schema, and every other node treats it as the source of truth on boot. Starting an older node first means it waits for the newer one to confirm state — and times out if that node is absent.
Can I just delete the Mnesia data to make the node start?
No. Deleting the data directory destroys durable queues, exchanges, users, and vhosts, and on a clustered node can corrupt the shared schema. The supported path is correct boot order, or forget_cluster_node for a confirmed-dead peer.
My last-stopped node is dead. How do I recover the rest?
Run rabbitmqctl forget_cluster_node rabbit@<deadnode> from a healthy node (or with --offline if the whole cluster is down). The survivors then stop waiting on the dead node and elect a new authoritative schema.
How long does RabbitMQ wait before failing?
By default each retry waits 30 seconds with 10 retries, roughly 5 minutes total, before printing BOOT FAILED. You can extend this with mnesia_table_loading_retry_timeout and mnesia_table_loading_retry_limit, but extending it will not help if the peer is permanently gone.
Does this affect quorum queues the same way? Quorum queues use the Raft protocol and tolerate node loss better at the queue level, but the cluster’s Mnesia metadata schema still uses the boot-order rules described here. The startup timeout is about cluster schema, not individual queue replication.
For more RabbitMQ operations guides, browse the RabbitMQ category.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.