Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for RabbitMQ By James Joyner IV · · 9 min read

RabbitMQ Error Guide: 'timeout_waiting_for_tables' Cluster Startup Failure

Fix RabbitMQ timeout_waiting_for_tables on startup: node boot order, last-disc-node down, Mnesia table sync, and forget_cluster_node recovery.

  • #rabbitmq
  • #troubleshooting
  • #errors
  • #clustering

Exact Error Message

When a clustered RabbitMQ node cannot synchronise its Mnesia tables on boot, the broker refuses to start and logs a fatal error. The full crash typically looks like this:

2026-06-24 09:12:41.882 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2026-06-24 09:13:11.883 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,
    [rabbit@node2],
    [rabbit_durable_queue,rabbit_durable_exchange,rabbit_runtime_parameters]}
2026-06-24 09:18:12.014 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 0 retries left

BOOT FAILED
===========
Error during startup: {error,
    {timeout_waiting_for_tables,
        [rabbit@node1,rabbit@node2],
        [rabbit_durable_queue,rabbit_durable_exchange,
         rabbit_runtime_parameters]}}

2026-06-24 09:18:12.117 [error] <0.272.0> CRASH REPORT Process <0.272.0> with 0 neighbours crashed with reason:
    {{timeout_waiting_for_tables,[rabbit@node1,rabbit@node2],
        [rabbit_durable_queue,rabbit_durable_exchange]},...}
2026-06-24 09:18:12.119 [error] <0.271.0> Supervisor rabbit_prelaunch_sup had child prelaunch started with
    rabbit_prelaunch:run_prelaunch_first_phase() at undefined exit with reason
    {timeout_waiting_for_tables,...} in context start_error
2026-06-24 09:18:12.130 [error] <0.270.0> Runtime terminating during boot ({timeout_waiting_for_tables})

init terminating in do_boot ({timeout_waiting_for_tables,...})

You may also see this surfaced by systemd as rabbitmq-server.service: Main process exited, code=exited, status=70/... followed by the unit entering a failed state.

What the Error Means

RabbitMQ stores cluster metadata — durable queues, exchanges, bindings, users, vhosts, and runtime parameters — in Mnesia, Erlang’s distributed database. In a cluster, disc nodes each hold a copy of this schema. When a node boots, it must agree on the current state of these tables with the other nodes it remembers being clustered with.

The error timeout_waiting_for_tables means the booting node waited (by default 30 seconds per retry, 10 retries, so up to ~5 minutes) for one or more peer nodes to come online and share their Mnesia table copies — and they never appeared. Because the node cannot safely guess which copy is authoritative, it aborts the boot rather than risk diverging schema or losing data.

The list inside the error — [rabbit@node1,rabbit@node2] — is the set of nodes whose tables the booting node is still waiting on. The second list names the actual Mnesia tables that failed to sync.

The root principle: the last node to stop must be the first node to start. That node holds the most recent view of the cluster, and every other node defers to it on boot.

Common Causes

  • Wrong boot order after a full cluster shutdown. If node2 was the last to stop but you start node1 first, node1 waits for node2 to confirm the schema — and times out if node2 stays down.
  • The last-stopped disc node is down or unrecoverable. Hardware failure, a deleted VM, or a corrupted data directory on the authoritative node leaves survivors waiting forever.
  • Network connectivity or DNS issues between nodes, so the booting node cannot reach a peer that is actually running (blocked port 4369/epmd, 25672 for inter-node traffic, or unresolvable hostnames).
  • Erlang cookie mismatch. If /var/lib/rabbitmq/.erlang.cookie differs between nodes, they cannot authenticate to each other, so tables never sync.
  • Hostname changes. RabbitMQ node identity is tied to its hostname (rabbit@node1). Renaming the host orphans the old node name in the cluster schema.
  • A simultaneous power loss where no node recorded a clean shutdown, leaving ambiguity about which node is most current.

How to Reproduce the Error

You can recreate this safely in a lab to understand the boot-order dependency:

  1. Build a two-node cluster: rabbit@node1 and rabbit@node2, both as disc nodes.
  2. Declare a durable queue so there is schema to sync.
  3. Stop rabbit@node1 first, then stop rabbit@node2. Now node2 is the last-stopped (authoritative) node.
  4. Start rabbit@node1 first while node2 stays offline.

node1 will log Waiting for Mnesia tables, retry until exhausted, and finally print BOOT FAILED with timeout_waiting_for_tables,[rabbit@node2]. It is waiting for node2 because node2 held the newest schema view.

Diagnostic Commands

Start by confirming what each node thinks the cluster looks like and whether peers are reachable. These are all read-only.

# Cluster membership and which nodes are running (run on any reachable node)
rabbitmqctl cluster_status

# Lightweight health and Erlang distribution checks
rabbitmq-diagnostics ping
rabbitmq-diagnostics status
rabbitmq-diagnostics check_running

# Full broker status including listeners and partitions
rabbitmqctl status

# Tail the boot log for the timeout error and the node list it is waiting on
grep -i "timeout_waiting_for_tables\|Waiting for Mnesia\|BOOT FAILED" \
  /var/log/rabbitmq/rabbit@$(hostname -s).log

# systemd-level view of the failed start
journalctl -u rabbitmq-server --no-pager -n 100

# Confirm epmd (4369) and inter-node port (25672) are listening / reachable
ss -ltnp | grep -E ':4369|:25672'

Healthy cluster_status output on a working two-node cluster looks like this:

Cluster status of node rabbit@node1 ...
Basics

Cluster name: rabbit@node1

Disk Nodes

rabbit@node1
rabbit@node2

Running Nodes

rabbit@node1
rabbit@node2

Versions

rabbit@node1: RabbitMQ 3.13.2 on Erlang 26.2.5
rabbit@node2: RabbitMQ 3.13.2 on Erlang 26.2.5

Maintenance status

Node: rabbit@node1, status: not under maintenance
Node: rabbit@node2, status: not under maintenance

If node2 is failed, you would instead see it listed under Disk Nodes but absent from Running Nodes, confirming the survivor is waiting on an offline peer.

Step-by-Step Resolution

1. Identify the last-stopped node. Check shutdown timestamps in each node’s log. The node that logged its Stopping RabbitMQ / shutdown message last is authoritative. Search the logs:

grep -i "Stopping RabbitMQ\|stopped\|Successfully stopped" \
  /var/log/rabbitmq/rabbit@$(hostname -s).log

2. Start nodes in the correct order. Stop the failing node, then start the last-stopped node first. Once it is fully up, start the remaining nodes. They will find the authoritative schema and sync immediately. In most cases this alone resolves the timeout — no data changes required.

3. If the last-stopped node is recoverable but slow, simply give it more time or bring it online before the others. You can verify it is up with rabbitmq-diagnostics check_running before starting peers.

4. If the last-stopped node is permanently gone (dead hardware, deleted VM), the survivors will wait forever for a node that will never return. In that case you must tell the cluster to forget the dead node so the survivors can elect a new authoritative schema. This is a deliberate recovery action — run it from a running, reachable node against the dead node name:

# RECOVERY ACTION (run intentionally, from a healthy node, targeting the DEAD node):
# Removes the unrecoverable node so survivors can boot without waiting on it.
rabbitmqctl forget_cluster_node rabbit@node2

If no node is currently running (every node is stuck in the timeout), you may need to forget the dead node in offline mode by adding the --offline flag while the rest of the cluster is down. Always target only the node you have confirmed is unrecoverable.

5. Rule out connectivity and authentication. If the peer is actually up but unreachable, confirm the Erlang cookie matches across nodes, hostnames resolve, and ports 4369 and 25672 are open. Use rabbitmq-diagnostics ping rabbit@node2 from the booting host to test the Erlang distribution path.

6. Verify recovery. Once the cluster is back, confirm membership and that all nodes are running:

rabbitmqctl cluster_status
rabbitmq-diagnostics check_running

For production incidents where you want a guided runbook and post-mortem timeline, our incident response workspace can capture the boot-order sequence and remediation steps automatically.

Prevention and Best Practices

  • Always shut down and start up in a known order. Document which node is your primary disc node and start it first. Scripts that stop nodes in reverse start-order make recovery predictable.
  • Avoid full simultaneous shutdowns. Rolling restarts keep a quorum of nodes online so the schema is never ambiguous.
  • Use three or more disc nodes rather than two, so the loss of one node never blocks the survivors on a single point of failure.
  • Tune the wait if appropriate. The environment variable RABBITMQ_MNESIA_DIR aside, you can raise the table-wait retry budget via the mnesia_table_loading_retry_timeout and mnesia_table_loading_retry_limit advanced config keys to tolerate slower peers, but this is a band-aid, not a fix for a dead node.
  • Pin hostnames. Use stable DNS names or /etc/hosts entries so node identity never drifts after a reboot.
  • Back up the Mnesia directory and definitions (rabbitmqctl export_definitions) regularly so you can rebuild schema if a disc node is lost.
  • Monitor cluster health continuously so you catch a node that stopped without a clean shutdown before the next restart turns it into a boot failure.
  • Mnesia network partition / partition detected — when nodes lose contact mid-run and each side keeps writing, you get a split-brain partition reported in rabbitmqctl cluster_status under a Network Partitions section. This is the running-cluster cousin of the boot-time timeout.
  • inconsistent_database — Mnesia detects that two nodes have conflicting schema versions, often after a partition or an out-of-order restart. It surfaces as running_partitioned_network or inconsistent_database in the logs and usually requires choosing a winning node.
  • Node rabbit@nodeX not running / nodedown — clustering commands fail because the target node is unreachable, frequently caused by the same epmd, cookie, or hostname problems behind a timeout_waiting_for_tables.

Frequently Asked Questions

Why does RabbitMQ care which node started first? Because Mnesia has no central coordinator. The last node to stop holds the most recent schema, and every other node treats it as the source of truth on boot. Starting an older node first means it waits for the newer one to confirm state — and times out if that node is absent.

Can I just delete the Mnesia data to make the node start? No. Deleting the data directory destroys durable queues, exchanges, users, and vhosts, and on a clustered node can corrupt the shared schema. The supported path is correct boot order, or forget_cluster_node for a confirmed-dead peer.

My last-stopped node is dead. How do I recover the rest? Run rabbitmqctl forget_cluster_node rabbit@<deadnode> from a healthy node (or with --offline if the whole cluster is down). The survivors then stop waiting on the dead node and elect a new authoritative schema.

How long does RabbitMQ wait before failing? By default each retry waits 30 seconds with 10 retries, roughly 5 minutes total, before printing BOOT FAILED. You can extend this with mnesia_table_loading_retry_timeout and mnesia_table_loading_retry_limit, but extending it will not help if the peer is permanently gone.

Does this affect quorum queues the same way? Quorum queues use the Raft protocol and tolerate node loss better at the queue level, but the cluster’s Mnesia metadata schema still uses the boot-order rules described here. The startup timeout is about cluster schema, not individual queue replication.

For more RabbitMQ operations guides, browse the RabbitMQ category.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.