RabbitMQ Error Guide: 'statistics database could not be contacted' Metrics Failure
Fix RabbitMQ statistics database unavailable and metrics timeout errors: overloaded stats collector, rates mode, and large topologies stalling the management UI.
- #rabbitmq
- #troubleshooting
- #errors
- #management
Exact Error Message
The management UI banner and the HTTP API both report that metrics cannot be produced:
Statistics database could not be contacted. Message rates and queue lengths
may be temporarily unavailable.
The API returns a 503 or 500 with a body like:
HTTP/1.1 503 Service Unavailable
{"error":"Internal Server Error","reason":"{error,{timeout,...}}"}
The broker log shows the metrics collector falling behind or timing out:
2026-06-29 11:42:18.004 [warning] <0.987.0> Statistics database could not be contacted.
2026-06-29 11:42:48.219 [error] <0.991.0> Error generating metrics: {timeout,
{gen_server,call,[rabbit_mgmt_db,{get_overview,...},30000]}}
2026-06-29 11:43:01.560 [warning] <0.987.0> Management DB: queue events backlog 184213,
dropping older samples
What the Error Means
The management plugin keeps an in-memory statistics database (the rabbit_mgmt_db / metrics collector) that aggregates events emitted by every connection, channel, queue, and node — message rates, queue depths, delivery counts. The UI and /api/* endpoints query this aggregator. When the aggregator cannot keep up with the volume of events, queries to it time out, and the plugin reports “statistics database could not be contacted” or “error generating metrics.”
Crucially, this is a metrics-plane failure, not a data-plane failure. Messages keep flowing through queues normally; only the dashboard and the metrics API are degraded. The collector is a single process per node, so on a large topology with high event rates it becomes a bottleneck while AMQP itself stays healthy.
Common Causes
- Too many objects emitting events. Tens of thousands of queues/connections/channels generate more samples than the collector can aggregate.
- Fine-grained rates mode under load.
detailed/basicrates with short sample intervals multiply the per-object work. - A burst of short-lived connections/channels. Connection churn floods the collector with create/delete events.
- Undersized node. CPU-starved or memory-pressured nodes cannot drain the event backlog.
- A long, expensive API query. Fetching
/api/queuesfor the whole cluster with no pagination forces a huge aggregation in one call. - Stats collection interval misconfigured. A very low
collect_statistics_intervalincreases overhead disproportionately.
How to Reproduce the Error
Create churn and a large topology, then hammer the metrics API:
# create many queues to inflate the event volume
for i in $(seq 1 20000); do
rabbitmqadmin declare queue name=load-$i durable=false
done
# repeatedly request the full, unpaginated queue list
while true; do curl -s -u admin:admin \
http://localhost:15672/api/queues >/dev/null; done
On a modest node the collector backlog grows, /api/overview starts timing out, and the UI shows “statistics database could not be contacted.”
Diagnostic Commands
# Is the metrics/stats collector overloaded? Check the management DB process
rabbitmq-diagnostics observer --interval 5 # watch rabbit_mgmt_db / metrics procs
# How large is the topology the collector must aggregate?
rabbitmqctl list_queues --no-table-headers name | wc -l
rabbitmqctl list_connections --no-table-headers name | wc -l
rabbitmqctl list_channels --no-table-headers name | wc -l
48213 # queues
9120 # connections
26540 # channels
# Pull metrics/timeout errors from the log
sudo grep -iE 'statistics database|generating metrics|Management DB' \
/var/log/rabbitmq/rabbit@$(hostname -s).log | tail -15
# Check the configured rates mode and collection interval
rabbitmq-diagnostics environment | grep -iE 'rates_mode|collect_statistics'
# Time a lightweight vs heavyweight API call to see where it stalls
time curl -s -u admin:admin http://localhost:15672/api/overview >/dev/null
time curl -s -u admin:admin 'http://localhost:15672/api/queues?page=1&page_size=100' >/dev/null
A fast /api/overview but slow/failing full /api/queues points squarely at aggregation volume.
Step-by-Step Resolution
-
Confirm it is metrics-only. Verify AMQP is healthy (
rabbitmq-diagnostics check_running, queues still draining). If publishing/consuming works, the problem is the stats collector, not the broker. -
Reduce rates granularity. In
rabbitmq.conf, set a lighter rates mode:management.rates_mode = basicUse
noneif you only need static topology and not per-object rate charts; this dramatically cuts collector work. -
Paginate every API query. Stop fetching the whole cluster at once; request
?page=1&page_size=100and select only needed columns with?columns=name,messagesso the aggregator returns less. -
Cut connection/channel churn. Make clients use long-lived connections and channels instead of opening one per operation; churn is a top driver of collector backlog.
-
Lengthen the collection interval if it was set aggressively, trading dashboard freshness for headroom.
-
Move metrics off the management DB entirely. For large clusters, scrape with
rabbitmq_prometheus(port 15692), which reads native per-object metrics without the aggregating collector, and keep the UI for ad-hoc use. -
Scale or rebalance the node if it is simply CPU/memory starved, so the collector can drain its backlog.
Verify recovery by re-running the timed /api/overview call and confirming the log backlog warnings stop.
Prevention and Best Practices
- Prefer
rabbitmq_prometheusfor ongoing monitoring; reserve the management DB for interactive debugging. - Always paginate and column-filter management API calls in automation — never pull
/api/queuesfor the whole cluster unbounded. - Keep connection and channel counts down with pooling and long-lived clients; churn is the silent killer of the stats collector.
- Right-size rates mode:
basicornoneon large topologies,detailedonly on small clusters or short investigations. - Alert on
statistics database could not be contactedand on collector backlog log lines so degradation is caught before the UI goes dark. - Size broker nodes with CPU headroom; the collector competes with AMQP for cores under load.
Related Errors
- operation timed out (RPC) —
rabbitmqctlcalls time out against a busy node; a related overload symptom on the control plane. - management listener failed to start — the UI never comes up at all, versus loading but failing on metrics.
- memory/disk alarm (resource alarm set) — node-level pressure that also starves the collector.
- HTTP access denied — a 401/403 auth failure, not a 500/503 metrics failure.
More in the RabbitMQ guides.
Frequently Asked Questions
Are my messages being lost when I see this error? No. This is a metrics-plane failure. Queues keep delivering; only the dashboard and metrics API are degraded.
What is the fastest mitigation?
Lower management.rates_mode to basic or none and paginate API calls. Both cut collector load immediately.
Should I keep using the management API for monitoring at scale?
No. Use rabbitmq_prometheus on port 15692 for production monitoring; it bypasses the aggregating stats database.
Why does /api/overview work but /api/queues time out?
overview is a small aggregate; an unpaginated queues call forces the collector to assemble every queue’s stats in one request.
Does connection churn really matter that much? Yes. Each create/delete emits events the collector must process. Long-lived connections and channels are one of the biggest reductions you can make.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.