Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Error while electing or becoming controller on broker 1' Election Failure

Fix Kafka 'Error while electing or becoming controller on broker 1': diagnose ZooKeeper session loss, quorum problems, znode conflicts, and stuck controller election.

  • #kafka
  • #troubleshooting
  • #errors
  • #controller

Exact Error Message

A broker tries to take the controller role and fails partway through the handoff. The server.log (or controller.log) shows:

[2026-06-28 09:41:03,221] ERROR [Controller id=1] Error while electing or becoming controller on broker 1.
 Trigger controller movement immediately (kafka.controller.KafkaController)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /controller
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:134)
        at kafka.zk.KafkaZkClient.registerControllerAndIncrementControllerEpoch(KafkaZkClient.scala:189)
        at kafka.controller.KafkaController.elect(KafkaController.scala:1402)
[2026-06-28 09:41:03,402] WARN [Controller id=1] Unable to become controller, resigning
 (kafka.controller.KafkaController)
[2026-06-28 09:41:03,540] ERROR [Controller id=1] Error processing controller request
 (kafka.controller.ControllerEventManager)

In a KRaft cluster the analogous failure appears on the controller node as the broker being unable to win or keep quorum leadership.

What the Error Means

Becoming the controller is a multi-step transaction. In ZooKeeper mode the candidate broker must create the /controller znode, increment the controller epoch in /controller_epoch, and then load full cluster state. If any step fails — the ZooKeeper session expires mid-election, the znode is held by another broker, or reading metadata throws — Kafka aborts with “Error while electing or becoming controller on broker N” and resigns so another broker can try.

“Unable to become controller, resigning” and “Error processing controller request” are the follow-on lines: the broker backed out of a partial election to avoid a corrupt half-controller state. The cluster is designed to retry election, so a single occurrence followed by a successful election elsewhere is recoverable. Repeated occurrences mean the underlying dependency — ZooKeeper connectivity, the KRaft quorum, or the metadata store — is unhealthy, and no broker can complete election, which leaves the cluster without a working controller.

Common Causes

  • ZooKeeper session expiry during election (the snippet above): the broker’s ZK session timed out before it finished writing /controller, usually due to GC pauses, network blips, or an overloaded ZK ensemble.
  • Stale /controller znode held by a crashed broker. A previous controller died without its ephemeral znode expiring yet, so the candidate cannot create it.
  • ZooKeeper ensemble unhealthy / no quorum. Fewer than a majority of ZK nodes are up, so writes (epoch increment) fail.
  • KRaft quorum loss. In KRaft mode, if fewer than a majority of controller voters are reachable, no node can win leadership, producing repeated election failures.
  • Clock skew or epoch conflicts that make the epoch increment fail or be rejected.
  • Disk or metadata corruption on the candidate that throws while loading cluster state after winning the znode.

How to Reproduce the Error

On a disposable ZK-based test cluster:

  1. Reduce zookeeper.session.timeout.ms to a small value (e.g. 3000) in server.properties.
  2. Introduce latency/packet loss between a broker and the ZooKeeper ensemble (tc qdisc add dev eth0 root netem delay 4000ms).
  3. Stop the current controller so the latency-impaired broker tries to win election.
  4. Its session expires mid-election and server.log logs “Error while electing or becoming controller” followed by “Unable to become controller, resigning”.

Remove the network impairment to let election succeed normally. (Reproduce only in a throwaway environment.)

Diagnostic Commands

All read-only.

Check the KRaft quorum health (is there a leader and a majority of voters?):

kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --replication

For ZooKeeper mode, inspect the controller and epoch znodes (legacy, read-only):

zookeeper-shell.sh localhost:2181 get /controller
zookeeper-shell.sh localhost:2181 get /controller_epoch

Confirm the broker can talk to the cluster at all:

kafka-broker-api-versions.sh --bootstrap-server localhost:9092 | head -3

Pull the election failures and their root exception from the logs:

grep -E "Error while electing or becoming controller|Unable to become controller|SessionExpired|Error processing controller request" \
  /var/log/kafka/server.log | tail -30
journalctl -u kafka --since "1 hour ago" | grep -iE "controller|zookeeper|quorum"

Check for GC pauses that can expire ZK sessions:

grep -E "Pause|Full GC" /var/log/kafka/kafkaServer-gc.log | tail -20

Step-by-Step Resolution

  1. Identify the failing dependency from the stack trace. SessionExpiredException points at ZooKeeper; a quorum/leadership message points at KRaft; a metadata read exception points at local state.
  2. For ZooKeeper session expiry: verify the ZK ensemble has quorum (echo srvr | nc <zk-host> 2181 or check ZK logs), then address the cause of the expiry — GC tuning, network stability, or raising zookeeper.session.timeout.ms (commonly 18000) so transient blips don’t abort election.
  3. For a stale /controller znode: confirm via get /controller that it points at a dead broker. The ephemeral znode normally expires automatically once the ZK session times out; if it does not, the dead broker’s session is still alive — stop that broker process fully.
  4. For KRaft quorum loss: ensure a majority of controller voters are running and reachable. describe --status must show a LeaderId and CurrentVoters with a majority online. Recover offline voters before expecting election to succeed.
  5. For local corruption: if one broker repeatedly fails to load state after winning the znode, inspect its metadata/log dirs for I/O errors (dmesg, journalctl) and address the disk; the cluster can elect another broker meanwhile.
  6. Confirm recovery: a clean run logs “successfully elected as the controller. Epoch incremented to N” with no follow-on errors.

Prevention and Best Practices

  • Set zookeeper.session.timeout.ms generously (often 18000+) so brief network or GC hiccups don’t expire sessions mid-election.
  • Run ZooKeeper / KRaft controllers on dedicated, well-provisioned nodes with fast disks and low-latency networking; election is latency-sensitive.
  • Tune broker GC to avoid long stop-the-world pauses that look like session loss; monitor kafkaServer-gc.log.
  • Always keep an odd majority of ZK nodes or KRaft voters online; never run with an even or sub-majority quorum during maintenance.
  • Alert on the controller election rate and on “Unable to become controller” — repeated entries mean election is stuck, not just failing over.
  • For triage help, the free incident assistant can map the stack trace to a likely root cause.
  • This is not the correct controller for this cluster — benign stale-controller view after a successful failover, not an election failure.
  • Controller not available — election keeps failing so no controller exists; clients see no controller at all.
  • Controller epoch X is older than Y — a stale controller acting after losing leadership.
  • org.apache.zookeeper.KeeperException$SessionExpiredException — the underlying ZK cause often seen in this trace.

Frequently Asked Questions

Is one occurrence of this error fatal? No. A single “Error while electing or becoming controller” followed by a successful election on another broker is recoverable — the broker deliberately resigned a partial election. Repeated occurrences indicate a persistent dependency problem.

The stack trace says SessionExpired. What do I fix? The ZooKeeper session for that broker expired mid-election. Stabilize ZK connectivity, reduce GC pauses, and raise zookeeper.session.timeout.ms. The expiry is the cause; the election error is the symptom.

We are on KRaft, not ZooKeeper. Why do we see election failures? In KRaft the controller is the metadata quorum leader. If a majority of voters is unreachable, no node can win or hold leadership, producing repeated election failures. Restore quorum first.

Why does the broker “resign” right after trying to become controller? Resigning is the safe response to a partially completed election. Backing out prevents two half-initialized controllers and lets another broker attempt a clean election.

How do I tell which broker should be controller? Any eligible broker can be. Use kafka-metadata-quorum.sh ... describe --status (KRaft) or get /controller (ZK) to see who currently holds it once election succeeds.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.