Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Raft leader election failed' No Quorum Leader Elected

Fix KRaft 'Raft leader election failed': diagnose missing quorum leader, bad controller.quorum.voters, network partitions, and clock/epoch issues between controllers.

  • #kafka
  • #troubleshooting
  • #errors
  • #kraft

Exact Error Message

In KRaft mode, the controllers run an internal Raft quorum that manages the __cluster_metadata topic. When the voters cannot agree on a leader, you see repeated election cycles in controller.log and never reach a stable leader:

[2026-06-29 14:02:11,318] INFO [RaftManager id=1] Completed transition to CandidateState(localId=1, epoch=487, retries=12, ...) (org.apache.kafka.raft.QuorumState)
[2026-06-29 14:02:11,820] INFO [RaftManager id=1] Re-elect as candidate after election backoff has completed (org.apache.kafka.raft.KafkaRaftClient)
[2026-06-29 14:02:12,114] WARN [RaftManager id=1] Raft leader election failed; received 0 votes from voters [2, 3], rejected by [] (org.apache.kafka.raft.KafkaRaftClient)
[2026-06-29 14:02:12,640] INFO [QuorumController id=1] Becoming inactive: no quorum leader elected within timeout (org.apache.kafka.controller.QuorumController)

The epoch number keeps climbing (487, 488, 489…) because each failed election bumps the term, but no node ever wins a majority of votes.

What the Error Means

KRaft replaces ZooKeeper with a built-in Raft consensus group made up of the controller nodes listed in controller.quorum.voters. To make progress, this group must elect a single leader, and a candidate only becomes leader if it receives votes from a strict majority (for example, 2 of 3 voters). “Raft leader election failed” means a candidate started an election, requested votes, and did not gather a majority before the election timeout, so it retries with a higher epoch.

While there is no leader, the controller quorum is read-only at best and usually fully unavailable: no metadata changes (topic creation, partition reassignment, broker registration) can be committed. Brokers that depend on this metadata may refuse to start or begin fencing partitions. This is distinct from “metadata quorum unavailable,” which describes the symptom from the client side; this error is the underlying election failing to converge.

Common Causes

  • Misconfigured controller.quorum.voters — the voter set differs between nodes, or the IDs/ports do not match the actual node.id and controller listener of each host. A node voting for a peer it cannot identify will reject or ignore the vote.
  • Network partition between controllers — the controller listener port (commonly 9093) is blocked or flapping, so vote requests never reach a majority.
  • Even number of voters or only one reachable voter — with 2 of 4 voters down you have no majority; with an even voter count a tie can persist.
  • Clock skew or persistent epoch confusion after a messy restart, where stale epochs cause votes to be rejected as outdated.
  • A single voter trying to elect itself in a multi-voter quorum after the others were removed or never started.
  • Mismatched cluster ID from a partially re-formatted node, so its votes are rejected by the rest of the quorum.

How to Reproduce the Error

On a three-controller KRaft cluster, stop two of the three controller processes (or block the controller listener port between them) so no node can reach a majority:

# On controller 2 and controller 3, stop the process (lab only)
sudo systemctl stop kafka

# Watch controller 1 fail to elect a leader
sudo journalctl -u kafka -f | grep -i 'election\|candidate\|quorum'

Controller 1 transitions to candidate, bumps its epoch on every cycle, and logs “Raft leader election failed; received 0 votes” because the other two voters are unreachable. A configuration variant: set different controller.quorum.voters strings on each node and restart — votes are rejected and elections never converge.

Diagnostic Commands

All commands below are read-only.

# Quorum status: who (if anyone) is the leader, current epoch, lag
kafka-metadata-quorum.sh --bootstrap-controller controller1:9093 \
  describe --status

# Per-voter replication state (LastFetchTimestamp shows reachability)
kafka-metadata-quorum.sh --bootstrap-controller controller1:9093 \
  describe --replication

# Confirm the configured voter set on THIS node matches reality
kafka-storage.sh info -c /etc/kafka/controller.properties
grep -E '^(node.id|process.roles|controller.quorum.voters|controller.quorum.bootstrap.servers|listeners)' \
  /etc/kafka/controller.properties

# Election churn in the controller log
grep -iE 'election failed|becoming inactive|transition to Candidate|received .* votes' \
  /var/log/kafka/controller.log | tail -40

# Is the controller listener reachable between nodes? (read-only probe)
ss -ltnp | grep 9093

If describe --status errors with “no leader” against every bootstrap controller, the quorum genuinely has no leader. Compare controller.quorum.voters across all nodes and confirm each id@host:port is correct and reachable.

Step-by-Step Resolution

  1. Verify the voter set is identical and correct on every controller. The controller.quorum.voters string (e.g. 1@c1:9093,2@c2:9093,3@c3:9093) must be byte-for-byte the same on all static-quorum nodes, and each ID/host/port must match that node’s actual node.id and controller listeners. Fix mismatches and restart only the misconfigured node.
  2. Confirm a majority of voters are actually running. A 3-voter quorum needs 2 up; a 5-voter quorum needs 3. Start any stopped controllers so a majority exists, then watch the election converge.
  3. Restore network reachability on the controller listener. Ensure the controller port (9093) is open between all controllers in both directions. Vote requests are bidirectional.
  4. Check for cluster ID mismatch. Run kafka-storage.sh info on each node; a node formatted with a different cluster ID will be rejected. Re-format only that node against the correct cluster ID following your runbook (not covered here, as it is destructive).
  5. Verify clocks are in sync with NTP/chrony so epoch comparisons behave.
  6. Restart cleanly, one node at a time, watching describe --status until LeaderId is populated and HighWatermark advances.

Prevention and Best Practices

  • Always run an odd number of controllers (3 or 5) so a clear majority is always definable.
  • Manage controller.quorum.voters (or controller.quorum.bootstrap.servers for KIP-853 dynamic quorums) through configuration management so every node shares an identical, version-controlled voter definition.
  • Spread controllers across failure domains (racks/AZs) but keep the controller listener latency low — Raft is latency-sensitive.
  • Alert on kafka-metadata-quorum.sh describe --status returning no leader or a stalled HighWatermark.
  • Keep NTP enabled and monitor clock skew on controller hosts.
  • Never change a node’s node.id or cluster ID in place; treat those as immutable identity.
  • Metadata quorum unavailable — the client-facing symptom when this election never succeeds. See the Kafka guides for that walkthrough.
  • Unable to fetch metadata log — a follower controller too far behind to participate effectively in elections.
  • Leader epoch mismatch — fencing after a contested election bumps the epoch.

Frequently Asked Questions

Why does the epoch keep increasing? Each failed election starts a new term (epoch) in Raft. A climbing epoch with no leader is the signature of an election that cannot reach majority.

Can a single surviving controller elect itself? No. One voter out of three is a minority and can never win. You need a majority up to elect any leader.

Does this affect brokers immediately? Brokers can run on cached metadata briefly, but no metadata mutations commit and new brokers cannot register, so the cluster effectively freezes for changes.

Is controller.quorum.bootstrap.servers a replacement for controller.quorum.voters? With KIP-853 dynamic quorums you bootstrap with controller.quorum.bootstrap.servers and manage membership via kafka-metadata-quorum.sh add-controller/remove-controller rather than a static voter list. Mixing the two inconsistently is itself a common cause of election failure.

How fast should an election complete? On a healthy LAN-latency quorum, within a few hundred milliseconds. Anything that drags for many seconds points at network or configuration problems.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.