RabbitMQ Error Guide: 'Mnesia is overloaded' Metadata Churn Warning
Fix RabbitMQ 'Mnesia is overloaded' dump_log write_threshold warnings caused by queue churn, exclusive/auto-delete storms, binding churn and slow disk.
- #rabbitmq
- #troubleshooting
- #errors
- #clustering
Exact Error Message
When RabbitMQ’s metadata store cannot keep up with the rate of changes you are throwing at it, the Erlang process behind it (Mnesia) starts emitting overload warnings to the log. They look like this:
=WARNING REPORT==== 24-Jun-2026::14:02:11.337201 ===
Mnesia('rabbit@node1'): ** WARNING ** Mnesia is overloaded: {dump_log, write_threshold}
=WARNING REPORT==== 24-Jun-2026::14:02:11.901744 ===
Mnesia('rabbit@node1'): ** WARNING ** Mnesia is overloaded: {dump_log, write_threshold}
=WARNING REPORT==== 24-Jun-2026::14:02:12.448901 ===
Mnesia('rabbit@node1'): ** WARNING ** Mnesia is overloaded: {dump_log, write_threshold}
=WARNING REPORT==== 24-Jun-2026::14:02:13.005112 ===
Mnesia('rabbit@node1'): ** WARNING ** Mnesia is overloaded: {dump_log, time_threshold}
You will usually see these warnings repeat many times per second in bursts. The two variants you should recognize are {dump_log, write_threshold} (too many transactions accumulated before the log could be dumped) and {dump_log, time_threshold} (a dump was triggered by time, but the previous dump had not finished). Both point at the same underlying problem: Mnesia’s transaction log is filling faster than it can be flushed to disk.
What the Error Means
RabbitMQ stores all of its metadata — queues, exchanges, bindings, vhosts, users, and policies — in Mnesia, Erlang’s built-in distributed database. Mnesia does not write every change straight into its main table files. Instead, it appends each committed transaction to an in-memory transaction log, then periodically dumps that log into the on-disk tables.
Two thresholds govern dumping. The dump_log_write_threshold is a count of accumulated transactions (default 1000); once that many commits pile up, a dump is forced. The dump_log_time_threshold is a time interval (default 180000 ms, or 3 minutes); a dump is forced when it elapses. When commits arrive faster than the dump can complete, Mnesia raises the overload warning to tell you it is falling behind.
Critically, this is a load problem, not a network partition. A partition ({running_partitioned_network, ...} or Mnesia(...): ** ERROR ** mnesia_event got {inconsistent_database, ...}) means cluster nodes disagree about state. The overload warning means a single node is being asked to commit metadata transactions faster than its disk can persist them. The warnings are not fatal on their own, but they signal that metadata operations are queuing up, which can cascade into slow connection setup, channel timeouts, and sluggish management UI.
Common Causes
Almost every metadata operation is a Mnesia transaction. High churn on metadata is what drives the log to overflow:
- Exclusive and auto-delete queue churn. Clients that declare a fresh exclusive or auto-delete queue on every connection or request, then disconnect, force a declare transaction and a delete transaction each cycle. A few thousand short-lived clients per second produce tens of thousands of Mnesia commits.
- Transient queue storms. RPC-style patterns (reply-to queues, temporary topic subscriptions) often create and destroy queues at request rate.
- Binding churn. Creating and tearing down bindings between exchanges and queues is also transactional. Fan-out topologies that rebind on every consumer reconnect hammer Mnesia.
- Connection/channel storms. Mass reconnects after a deploy or network blip mean every client redeclares its topology at once.
- Slow or contended disk. Even moderate churn overloads Mnesia if the data directory is on a slow EBS volume, a network filesystem, or a disk already saturated by message persistence. Slow disk does not cause the churn, but it lowers the threshold at which churn becomes overload.
How to Reproduce the Error
You can reproduce this safely on a non-production node with rabbitmqperf or a tiny script that rapidly declares and deletes auto-delete queues. Conceptually:
# Spin up many short-lived auto-delete queues in a tight loop.
# Each iteration = one declare + one auto-delete transaction in Mnesia.
for i in $(seq 1 50000); do
rabbitmqadmin declare queue name=churn-$i auto_delete=true durable=false >/dev/null
rabbitmqadmin delete queue name=churn-$i >/dev/null
done
Run this against a node whose data directory sits on a slow disk and tail the log. Within seconds you will see Mnesia is overloaded: {dump_log, write_threshold} start scrolling. This demonstrates the mechanism: it is the rate of metadata transactions, not the steady-state queue count, that triggers the warning.
Diagnostic Commands
Start by confirming the warnings and their frequency, then look for the churn source. All commands below are read-only.
Count how often the warning fires in the log:
grep -c "Mnesia is overloaded" /var/log/rabbitmq/rabbit@node1.log
grep "Mnesia is overloaded" /var/log/rabbitmq/rabbit@node1.log | tail -20
Get a snapshot of node health and the disk path Mnesia uses:
rabbitmq-diagnostics status
rabbitmq-diagnostics status | grep -i "data directory"
Look for queues that are exclusive or auto-delete (the usual churn culprits):
rabbitmqctl list_queues name auto_delete exclusive durable
Example output:
Listing queues for vhost / ...
name auto_delete exclusive durable
amq.gen-Hf2kQ9pL... true true false
amq.gen-Z8wQ1aXr... true true false
orders.durable false false true
amq.gen-7yTpLm0v... true true false
A long run of amq.gen-* names that are true true false indicates server-generated, exclusive, transient queues — a churn signature. Count total queues and bindings to gauge metadata volume:
rabbitmqctl list_queues | wc -l
rabbitmqctl list_bindings | wc -l
Check connection volume and rate of new connections, which often correlates with churn:
rabbitmqctl list_connections name peer_host state channels
rabbitmqctl list_connections | wc -l
Finally, confirm whether disk is the bottleneck. Watch I/O on the Mnesia/data volume:
iostat -x 2 5
iotop -o -b -n 3
High %util or large await on the device backing the data directory means the disk cannot flush the dump log fast enough.
Step-by-Step Resolution
- Confirm it is overload, not partition. Run
rabbitmq-diagnostics statusand check the log forinconsistent_databaseorrunning_partitioned_network. If those are absent and you only seedump_logwarnings, you have a load problem. - Identify the churn source. Use the
list_queues name auto_delete exclusive durableoutput. If most queues areamq.gen-*and auto-delete/exclusive, find the application creating them. Compare connection counts before and after a suspected deploy. - Fix the application pattern. This is the real cure. Replace per-request queue declaration with long-lived, reusable queues. Use direct reply-to (
amq.rabbitmq.reply-to) for RPC instead of creating a temporary reply queue per call. Stop rebinding on every reconnect; declare topology once at startup. - Relieve the disk. Move the RabbitMQ data directory to faster local SSD/NVMe. Ensure message persistence and Mnesia are not contending on the same saturated volume.
- Tune Mnesia dump thresholds (mitigation, not a fix). Raising
dump_log_write_thresholdlets more transactions batch per dump. Set it in the Erlang VM args, for example viaRABBITMQ_SERVER_ADDITIONAL_ERL_ARGSwith-mnesia dump_log_write_threshold 5000. This smooths bursts but does not fix runaway churn. - Add backpressure. Cap the connection/channel rate from misbehaving clients and consider per-vhost limits so one noisy app cannot saturate the cluster’s metadata path.
For high-volume RabbitMQ deployments, an on-call workflow that surfaces these warnings early is invaluable — our incident response tooling can triage log patterns like these automatically.
Prevention and Best Practices
- Treat metadata as long-lived. Declare queues, exchanges, and bindings at startup, not per message or per request.
- Prefer durable, named queues over server-generated exclusive ones for anything beyond genuinely ephemeral RPC.
- Use direct reply-to for request/response instead of temporary queues.
- Put the Mnesia data directory on fast, dedicated storage and monitor disk
await/%util. - Alert on
grep -c "Mnesia is overloaded"crossing a baseline, so churn regressions are caught at deploy time. - For very high topology-change rates, evaluate quorum queues and streams, which change the metadata and storage characteristics of your cluster.
Related Errors
This warning is easy to confuse with other Mnesia and clustering messages. The Mnesia network partition error (inconsistent_database / running_partitioned_network) is about nodes disagreeing on state, not about load — see our existing partition guide. The timeout_waiting_for_tables error appears at boot when a node cannot sync its Mnesia tables from peers, often because another node is down or slow. And PRECONDITION_FAILED - inequivalent arg happens when a client redeclares an existing queue with different arguments, which is sometimes a side effect of the same churn-prone code that triggers overload. More guides live under the RabbitMQ category.
Frequently Asked Questions
Is “Mnesia is overloaded” dangerous, or can I ignore it? It is a warning, not a crash, so the broker keeps running. But it means metadata operations are queuing, which degrades connection setup and channel performance. Treat sustained warnings as a real problem to investigate, not noise to suppress.
What is the difference between write_threshold and time_threshold?
{dump_log, write_threshold} fires when too many transactions accumulate before a dump completes (a volume signal). {dump_log, time_threshold} fires when the periodic dump timer elapses while a previous dump is still running (a timing signal). Both indicate Mnesia is behind on flushing its log.
Does raising dump_log_write_threshold fix it? Only partially. A larger threshold batches more commits per dump and smooths bursts, so the warnings appear less often. It does not reduce the underlying transaction rate or speed up the disk, so runaway churn will still overload the node eventually.
How is this different from a network partition?
Overload is a single-node load problem driven by metadata churn and slow disk. A partition is a cluster consistency problem where nodes lose contact and disagree on state. Check the log for inconsistent_database to tell them apart — that string means partition, not overload.
Will switching to quorum queues stop the warnings? It can help if your churn comes from classic mirrored queues, since quorum queues use a different replication model. But if the churn is from rapidly declaring and deleting exclusive or auto-delete queues, the fix is changing that application pattern, regardless of queue type.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.