Kafka Error Guide: 'Metadata quorum unavailable' Controller

Exact Error Message

When a client, broker, or admin tool tries to reach the KRaft controller quorum and a majority of controllers are down, you see the quorum reported as unavailable. From a broker’s server.log:

[2026-06-29 09:41:55,002] WARN [BrokerLifecycleManager id=12] Unable to send heartbeat to the active controller; quorum controller unavailable (kafka.server.BrokerLifecycleManager)
[2026-06-29 09:41:57,118] ERROR [RaftManager id=12] Metadata quorum unavailable: no active leader among voters [1, 2, 3] after 3 attempts (org.apache.kafka.raft.KafkaRaftClient)
[2026-06-29 09:42:00,540] WARN [BrokerServer id=12] Waiting for the controller quorum to become available before publishing metadata (kafka.server.BrokerServer)

Admin tools surface it more directly:

org.apache.kafka.common.errors.TimeoutException: The metadata quorum is unavailable;
could not reach a majority of controllers [1@c1:9093, 2@c2:9093, 3@c3:9093]

What the Error Means

The KRaft controller quorum commits metadata only when a majority of voters are alive and in contact with an elected leader. “Metadata quorum unavailable” (and the closely related “Quorum controller unavailable”) means a client could not find a reachable, active controller leader because the majority needed to sustain one is gone. With 2 of 3 controllers down — or all controllers reachable but unable to elect a leader — there is no entity that can read or commit __cluster_metadata.

Unlike “Raft leader election failed,” which describes the internal consensus mechanism churning, this error is the outward symptom: brokers cannot heartbeat, topic admin calls time out, and the cluster cannot register new brokers or accept metadata changes. Existing produce/consume traffic may continue briefly on cached metadata, but anything requiring the controller stalls.

Common Causes

A majority of controllers are stopped or crashed — the most common cause. 2-of-3 or 3-of-5 down leaves no quorum.
Controllers are up but cannot elect a leader (network partition on the controller listener, or split-brain), so there is technically no active controller.
Client points at the wrong bootstrap controllers — --bootstrap-controller or controller.quorum.bootstrap.servers lists hosts that are decommissioned or unreachable.
All controllers restarted at once and are still replaying the metadata log, so no leader has been published yet.
Disk-full or hung controllers that appear “up” to systemd but cannot serve Raft, so the quorum is functionally dead.
Firewall/security-group change that severs broker-to-controller (9093) connectivity even though controllers are healthy among themselves.

How to Reproduce the Error

On a three-controller cluster, take down two controllers so the majority is lost, then issue an admin command:

# Lab only: stop two of three controllers
sudo systemctl stop kafka   # on controller 2 and controller 3

# From a broker host, attempt a metadata operation
kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --status

With only one of three voters alive, the surviving controller cannot elect a leader and the command times out with the quorum-unavailable error. Brokers begin logging failed heartbeats to the controller.

Diagnostic Commands

All read-only.

# Try every controller as bootstrap to find ANY that responds
for h in c1 c2 c3; do
  echo "== $h =="
  kafka-metadata-quorum.sh --bootstrap-controller $h:9093 describe --status 2>&1 | head -8
done

# Replication view: which voters have recent LastFetchTimestamp
kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --replication

# Are the controller processes actually running and healthy?
sudo systemctl status kafka --no-pager | head -8
journalctl -u kafka --since "15 min ago" | grep -iE 'quorum|controller|heartbeat|disk'

# Controller-side log evidence
grep -iE 'quorum unavailable|no active leader|becoming inactive|out of disk' \
  /var/log/kafka/controller.log | tail -40

# Confirm controllers can see each other on 9093 (read-only)
ss -ltnp | grep 9093

If at least one controller answers describe --status with a valid LeaderId, the quorum is actually healthy and the problem is the client’s bootstrap list. If none answer, a majority is genuinely down.

Step-by-Step Resolution

Count how many controllers are actually alive and serving. A 3-voter quorum needs 2; a 5-voter quorum needs 3. Identify the minimum set you must restore.
Bring stopped controllers back up. Start the crashed/stopped controller processes. As soon as a majority is in contact, an election completes and a leader is published.
If controllers are “up” but unresponsive, check disk and hangs. A full data disk or a stuck JVM keeps the process registered but unable to serve Raft. Free space or restart the hung node.
Restore controller-to-controller and broker-to-controller networking on the controller listener port. Verify firewalls and security groups allow 9093 among all controllers and from brokers.
Fix the client bootstrap list if controllers are healthy: point --bootstrap-controller / controller.quorum.bootstrap.servers at currently valid controller hosts.
Confirm recovery with describe --status showing a non-null LeaderId and an advancing HighWatermark, then verify brokers resume heartbeating in server.log.

Prevention and Best Practices

Run 3 or 5 controllers across separate failure domains so losing one (or two, with 5) never breaks the majority.
Never restart all controllers simultaneously; roll them one at a time, waiting for each to rejoin the quorum.
Monitor each controller’s disk usage and alert well before the metadata log/snapshot directory fills.
Alert on brokers logging “quorum controller unavailable” and on describe --status failing across all controllers.
Keep controller.quorum.bootstrap.servers lists current; remove decommissioned controllers from client configs.
For a fast first pass on a quorum-down page, the free incident assistant can turn the broker log and describe --status output into a likely cause.

Raft leader election failed — the internal reason a quorum stays unavailable when controllers are up but cannot agree.
Unable to fetch metadata log — a controller that is up but too far behind to count toward the active quorum.
Metadata loader failed — a broker that reaches the quorum but cannot apply the metadata it receives.

Frequently Asked Questions

Will producers and consumers stop immediately? Not necessarily. Clients with cached metadata can often keep producing and consuming briefly, but any operation needing the controller (new topics, leader changes, broker registration) fails until the quorum recovers.

How many controllers can I lose? Tolerance is floor(N/2). With 3 you tolerate 1; with 5 you tolerate 2. Losing more than that makes the quorum unavailable.

The processes are running but the quorum is still unavailable — why? “Running” is not “serving.” A full disk, a hung JVM, or a controller-network partition leaves processes up but the Raft quorum dead. Check disk and inter-controller connectivity.

Is this the same as ZooKeeper being down in old Kafka? Conceptually yes — the metadata layer is unreachable — but in KRaft the metadata lives in the controllers themselves, so the fix is restoring a controller majority, not a separate ensemble.

Can I lower the timeout to fail faster? You can tune client request timeouts, but the real fix is restoring quorum availability, not masking it with shorter timeouts.

Kafka Error Guide: 'Metadata quorum unavailable' Controller Majority Down

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Related Errors

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit