RabbitMQ Error Guide: 'Mnesia network partition' Cluster

Overview

A Mnesia network partition — “split brain” — happens when nodes in a RabbitMQ cluster lose Erlang distribution connectivity to each other but each keeps running. Each side believes the other is down, so both continue serving. When connectivity returns, Mnesia (the embedded database holding cluster metadata) detects that the two sides diverged and reports a partition. Classic mirrored queues on either side may have accepted conflicting writes, which is why a partition is a data-integrity event, not just a connectivity blip.

You will see it in the broker log:

=ERROR REPORT==== Mnesia('rabbit@mq-02'): Mnesia(rabbit@mq-02): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, 'rabbit@mq-01'}

And surfaced in node status:

Network Partitions

 * Node rabbit@mq-01 cannot communicate with rabbit@mq-02

The trigger is almost always a network event (a flap, an overloaded host pausing the VM, a too-aggressive net_ticktime, or a long GC pause) that breaks Erlang distribution between nodes long enough for Mnesia to declare them partitioned.

Symptoms

cluster_status or status lists entries under “Network Partitions”.
Broker log shows running_partitioned_network / inconsistent_database.
Queues appear on some nodes but not others; clients see inconsistent state depending on which node they hit.

rabbitmqctl cluster_status

Cluster status of node rabbit@mq-01 ...
Disk Nodes
rabbit@mq-01
rabbit@mq-02
rabbit@mq-03

Running Nodes
rabbit@mq-01
rabbit@mq-03

Network Partitions
 * rabbit@mq-02 cannot communicate with rabbit@mq-01
 * rabbit@mq-02 cannot communicate with rabbit@mq-03

mq-02 is partitioned away from the other two — a 2-vs-1 split.

Common Root Causes

1. A transient network flap broke Erlang distribution

A brief loss of connectivity exceeds net_ticktime and Mnesia declares a partition.

rabbitmqctl eval 'net_kernel:get_net_ticktime().'

With a 60s ticktime, a network blip longer than the tick window severs distribution. The underlying network recovered, but Mnesia already recorded the partition.

2. A node was paused or GC-stalled long enough to miss ticks

An overloaded host (CPU starvation, a long stop-the-world GC, or a paused VM) stops responding to Erlang ticks and looks “gone” to peers.

sudo journalctl -u rabbitmq-server --since '-30min' | grep -i 'tick\|partition\|heartbeat' | tail

=WARNING REPORT==== Mnesia: missed too many ticks from rabbit@mq-02, considering it down

The node was alive but unresponsive — a classic “false partition” from resource starvation.

3. Partition handling set to ignore

With cluster_partition_handling = ignore, RabbitMQ lets both sides run and never auto-heals, leaving you with a manual split brain.

rabbitmqctl environment | grep -A2 'cluster_partition_handling'

{cluster_partition_handling,ignore},

ignore means you must resolve every partition by hand — no node was sacrificed to keep consistency.

4. autoheal could not pick a winner (or chose the smaller side)

autoheal waits for all nodes to come back, then restarts the losing partition. If nodes don’t all return, it stalls.

rabbitmqctl environment | grep -A2 'cluster_partition_handling'

{cluster_partition_handling,autoheal},

Autoheal needs every node reachable to decide a winner; a still-down node blocks healing until it returns.

5. pause_minority sacrificed the minority side

With pause_minority, nodes that find themselves in the minority pause to preserve consistency — correct behavior, but it manifests as a node that “won’t accept connections”.

rabbitmqctl cluster_status | grep -A5 'Running Nodes'

Running Nodes
rabbit@mq-01
rabbit@mq-03

mq-02 is paused (not running) because it was in the minority. It will resume when it can rejoin the majority — this is the intended outcome, not a failure.

6. Asymmetric / partial connectivity between specific nodes

A firewall or routing change blocks the Erlang distribution port (25672) between two nodes while others stay connected.

sudo ss -ltnp | grep 25672
nc -vz mq-02 25672

LISTEN 0 128 [::]:25672 [::]:*
nc: connect to mq-02 port 25672 (tcp) failed: Connection timed out

The distribution port is unreachable between these two nodes specifically — a partial partition that won’t heal until the path is restored.

Diagnostic Workflow

Step 1: Confirm the partition and which nodes split

rabbitmqctl cluster_status

The “Network Partitions” and “Running Nodes” sections show the split topology — note which side holds the majority and which holds your most authoritative data.

Step 2: Determine the configured partition handling strategy

rabbitmqctl environment | grep -A2 'cluster_partition_handling'

ignore = you resolve manually. pause_minority = the minority paused itself (often already self-correcting). autoheal = it will pick a winner once all nodes return.

Step 3: Verify Erlang distribution connectivity between nodes

nc -vz <OTHER_NODE_HOST> 25672      # Erlang distribution
nc -vz <OTHER_NODE_HOST> 4369       # epmd
rabbitmqctl eval 'net_adm:ping('"'"'rabbit@mq-02'"'"').'

pang (not pong) confirms the nodes still can’t reach each other — fix the network path before attempting to heal.

Step 4: Choose the authoritative side and restart the loser

Pick the partition with the correct/most-recent data as the winner, then restart the losing nodes so they rejoin and re-sync from the winner.

# On a LOSING node:
rabbitmqctl stop_app
rabbitmqctl start_app

Restarting the loser’s RabbitMQ app forces it to rejoin the cluster and discard its divergent state in favor of the winner’s.

Step 5: Confirm the partition cleared and queues are consistent

rabbitmqctl cluster_status | grep -A3 'Network Partitions'
rabbitmqctl list_queues name node messages | sort

An empty “Network Partitions” section and consistent queue placement/counts mean the cluster has healed.

Example Root Cause Analysis

Monitoring fires on mq-02 showing “no contact”. cluster_status from mq-01 reports a partition with mq-02 split from mq-01 and mq-03. Clients hitting mq-02 see stale queue state.

Checking the strategy:

rabbitmqctl environment | grep -A2 'cluster_partition_handling'

{cluster_partition_handling,ignore},

With ignore, nothing auto-healed — both sides kept running. Testing distribution connectivity from mq-01:

nc -vz mq-02 25672

Connection to mq-02 25672 port [tcp/*] succeeded!

The network path is now fine; this was a transient flap during a host migration that severed distribution long enough to partition, and ignore left it that way. mq-01/mq-03 form the majority and hold the authoritative state, so mq-02 is the loser. Recovery on mq-02:

rabbitmqctl stop_app
rabbitmqctl start_app
rabbitmqctl cluster_status | grep -A2 'Network Partitions'

Network Partitions

(none)

mq-02 rejoins, re-syncs metadata from the majority, and the partition clears. The durable fix was switching cluster_partition_handling to pause_minority so a future flap pauses the minority automatically instead of leaving a manual split brain.

Prevention Best Practices

Choose a partition handling strategy deliberately: pause_minority for consistency-first clusters (and run an odd number of nodes), or autoheal for availability-first — but never leave production on ignore.
Run an odd node count (3, 5) so pause_minority always has a clear majority to keep serving.
Use quorum queues instead of classic mirrored queues; their Raft-based replication tolerates partitions far more safely than mirrored queue sync.
Keep nodes on a low-latency, reliable network and avoid co-locating brokers with noisy neighbors that can pause the VM and trigger false partitions.
Tune net_ticktime to match your network’s real reliability rather than leaving it accidentally aggressive.
Alert directly on the “Network Partitions” section of cluster_status so a split brain is caught immediately. When it pages, the free incident assistant can read the cluster status and strategy to recommend the winner. More in the RabbitMQ guides.

Quick Command Reference

# Detect the partition and topology
rabbitmqctl cluster_status

# Current partition handling strategy
rabbitmqctl environment | grep -A2 'cluster_partition_handling'

# Erlang distribution / epmd reachability between nodes
nc -vz <NODE_HOST> 25672
nc -vz <NODE_HOST> 4369
rabbitmqctl eval 'net_adm:ping('"'"'rabbit@<NODE>'"'"').'

# Recover a losing node (rejoin + re-sync)
rabbitmqctl stop_app
rabbitmqctl start_app

# Confirm healed
rabbitmqctl cluster_status | grep -A3 'Network Partitions'

Conclusion

A Mnesia network partition is a split brain: nodes lost Erlang distribution, kept running, and diverged. The usual root causes:

A transient network flap exceeding net_ticktime.
A node paused or GC-stalled long enough to miss ticks (false partition).
cluster_partition_handling = ignore, which never auto-heals.
autoheal stalled because not all nodes returned.
pause_minority correctly pausing the minority side.
A blocked distribution port (25672) between specific nodes.

Confirm the split with cluster_status, restore the network path, choose the authoritative majority as the winner, restart the losing nodes to rejoin, and then move off ignore to a real partition handling strategy so the next flap heals itself.

RabbitMQ Error Guide: 'Mnesia network partition' Cluster Split Brain