Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for RabbitMQ By James Joyner IV · · 9 min read

RabbitMQ Error Guide: 'ra: command timeout' Quorum Queue Raft Timeout

Fix RabbitMQ quorum queue Raft timeouts: diagnose 'ra command timeout' and 'failed to start Raft' from slow disk, overloaded nodes, and network latency.

  • #rabbitmq
  • #troubleshooting
  • #errors
  • #quorum-queues

Exact Error Message

Quorum queues use the Ra Raft library. When a Raft command (an enqueue, ack, or membership change) cannot complete in time, you see Ra timeouts in the log and as client errors:

[error] <0.1142.0> ra: command timeout for {orders, rabbit@mq-02}
[warning] <0.1142.0> ra: leader rabbit@mq-01 for {orders} not responding,
 election timeout, starting pre-vote
operation queue.declare caused internal error:
 {timeout, {ra, '%2F_orders', command}}
[error] <0.1190.0> failed to start Raft server {stream_coordinator, rabbit@mq-03}:
 {error, {timeout, ...}}

Client side this surfaces as a publish or declare hanging, then a channel/operation error wrapping {timeout, {ra, ...}}.

What the Error Means

A quorum queue is a Raft consensus group: every operation that changes state (enqueue, ack, settle, add/remove member) is a Ra command that the leader must replicate to and get acknowledged by a majority of members before it is committed. ra: command timeout means the leader issued a command but did not receive majority confirmation within Ra’s command timeout. failed to start Raft means a Ra server could not initialize or recover its log in time.

These are liveness/latency failures, not necessarily quorum-loss failures: a majority may still be reachable, but commands are taking too long to commit because the disk is slow, the node is CPU-starved, the Raft log (WAL) is backed up, or the network between members is laggy. The cluster is trying to make progress and failing to do so promptly.

Common Causes

  • Slow disk / fsync latency. Ra fsyncs its write-ahead log on commit. A slow or contended disk makes every command slow and trips timeouts under load.
  • Overloaded node (CPU/scheduler starvation). The Ra leader process cannot run promptly, delaying command processing and heartbeats.
  • WAL or segment writer backlog. A flood of small messages overwhelms the shared Ra WAL, queuing commands.
  • Network latency or packet loss between members. Replication round-trips to followers exceed the command timeout, common across availability zones or WAN links.
  • Too many quorum queues per node. Thousands of Ra servers contend for the shared WAL and schedulers, increasing latency for all of them.
  • Large message bodies on quorum queues. Big payloads inflate log entries and replication time.
  • A recovering node replaying a huge log. On restart, failed to start Raft / slow recovery occurs while a large Ra log is replayed.

How to Reproduce the Error

Put a quorum queue on slow storage and flood it with small persistent messages from multiple publishers while the disk is saturated:

queue.declare(queue='orders', durable=true, arguments={'x-queue-type':'quorum'})
# saturate the data volume's I/O (e.g., concurrent fio job), then:
loop (many publishers):
  basic.publish(routing_key='orders', delivery_mode=2, body=small)
# leader cannot fsync/replicate commands fast enough ->
#   ra: command timeout for {orders, rabbit@...}

Adding cross-AZ members with added network latency reproduces the replication-side timeout even on fast disks.

Diagnostic Commands

# Quorum queue status: leader, members, and which are online
rabbitmqctl list_queues name type leader members online --sort=name | grep -i quorum

# Detailed Raft status for a specific queue (term, commit index, members)
rabbitmq-diagnostics quorum_status orders

# Cluster + node health and partition status
rabbitmqctl cluster_status

# Disk free + I/O context for the Ra data directory
df -h $(rabbitmqctl eval 'rabbit_mnesia:dir().' | tr -d '"')

# Erlang scheduler / runtime pressure (is the node CPU-starved?)
rabbitmq-diagnostics runtime_thread_stats 2>/dev/null | head -20

# Count quorum queues per node (too many = WAL contention)
rabbitmqctl list_queues name type | grep -c quorum

# Ra / timeout / election events in the log
journalctl -u rabbitmq-server --since "30 min ago" | grep -iE 'ra:|command timeout|election|failed to start raft'

quorum_status showing frequent term increments (leader churn) plus command timeout lines points to a node or disk that cannot keep the leadership stable.

Step-by-Step Resolution

  1. Confirm a majority is actually online. Run list_queues ... online and quorum_status. If a majority is online but commands still time out, this is a latency problem (this guide). If a majority is down, it is a quorum-loss problem instead.

  2. Check disk latency first. Ra fsyncs on every commit. Run df -h on the data directory and check disk I/O/fsync latency. Move the Ra data volume to faster local storage (NVMe) rather than network/EBS-style disks if fsync latency is high.

  3. Check node load. Use runtime_thread_stats. If schedulers are saturated, the node is CPU-starved; add vCPUs or rebalance queues across more nodes so the leader process runs promptly.

  4. Reduce quorum queue count per node. Thousands of Ra servers share one WAL. Consolidate queues, spread them across more nodes, or remove unused quorum queues to relieve WAL contention.

  5. Address network latency. For multi-AZ clusters, keep members within low-latency links; quorum queues are sensitive to replication round-trip time. Avoid stretching a single quorum group across high-latency WAN.

  6. Shrink message size and burst rate. Move large payloads out-of-band and smooth publish bursts so the WAL is not overwhelmed.

  7. For failed to start Raft on restart, allow time for log recovery, ensure the node has disk and memory headroom, and confirm cluster_status shows no network partition blocking the member from rejoining.

  8. Verify. Re-run quorum_status; a stable leader, advancing commit index, and no further command timeout lines confirm recovery.

Prevention and Best Practices

  • Use fast, low-latency local disk for quorum queue data; benchmark fsync latency before production.
  • Keep quorum group members within a low-latency network; do not stretch a single group across high-latency links.
  • Cap the number of quorum queues per node and spread them across the cluster to limit WAL contention.
  • Keep message bodies small and smooth publish bursts.
  • Size nodes with CPU headroom so the Ra leader process never starves.
  • Alert on quorum leader churn (term increments) and on command timeout log lines.
  • quorum cannot reach majority: the harder failure where a majority of members is down, not merely slow.
  • publisher nack received: Ra command timeouts can escalate into publisher nacks when commits cannot complete.
  • resource alarm: disk pressure on the Ra volume can both alarm and slow Raft commits.
  • flow control active: a quorum queue that cannot commit quickly applies back-pressure that surfaces as flow on publishers.

Frequently Asked Questions

What is Ra? Ra is the Erlang Raft consensus library RabbitMQ uses to implement quorum queues and streams. Every state change is a Ra command replicated to a majority of members.

Does a command timeout mean my data is lost? No. A timeout means the command did not commit in time; it is retried or surfaces as a publisher nack. Committed entries are durable. Treat timed-out publishes as failed and retry idempotently.

Why is disk speed so important for quorum queues? Ra fsyncs its write-ahead log on commit. Slow fsync latency directly slows every command, so disk performance dominates quorum queue throughput.

Is this the same as losing quorum? No. Command timeouts often happen with a majority online but slow. Losing quorum means a majority of members is actually down, which is a separate failure.

How many quorum queues can a node handle? There is no hard limit, but thousands of Ra servers contend for the shared WAL and schedulers. Spread queues across nodes and remove unused ones if you see WAL-related timeouts.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.