Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Controller epoch is older than the current controller epoch' Stale Epoch

Fix Kafka 'controller epoch is older than the current controller epoch': understand epoch fencing, split brain after a network partition, and how to confirm the live controller.

  • #kafka
  • #troubleshooting
  • #errors
  • #controller

Exact Error Message

A broker receives a controller request carrying an epoch lower than the one it already knows, and rejects it. In server.log:

[2026-06-28 22:17:44,902] WARN [Broker id=2] Received controller request with controller epoch 18
 which is older than the current controller epoch 21. Ignoring the request from a stale controller.
 (kafka.server.ReplicaManager)
[2026-06-28 22:17:44,905] ERROR [Controller id=1] Aborting controller startup: controller epoch 18
 is older than the controller epoch 21 stored in ZooKeeper. Resigning. (kafka.controller.KafkaController)
[2026-06-28 22:17:45,118] WARN [Controller id=1] Controller has been fenced. Stale controller epoch.
 (kafka.controller.KafkaController)

KRaft fences the same way using metadata leader epoch:

[2026-06-28 22:17:44,950] WARN [Controller id=1] Ignoring request with stale controller epoch 18,
 current epoch is 21 (org.apache.kafka.controller.QuorumController)

What the Error Means

Every time a new controller is elected, Kafka increments a monotonically increasing controller epoch (stored in the /controller_epoch znode in ZK mode, or as the metadata leader epoch in KRaft). Brokers tag and validate controller requests with this epoch. The rule is simple: a request stamped with an epoch lower than the current epoch is rejected because it can only come from a controller that has already been superseded.

“Controller epoch X is older than Y / stale controller epoch” therefore means a broker that used to be the controller (epoch 18) is still trying to act as one, while the cluster has since elected a newer controller (epoch 21). The fencing mechanism intentionally ignores the stale controller so it cannot corrupt cluster state — this is split-brain protection working as designed. The classic trigger is a network partition: the old controller was isolated, a new one was elected on the majority side, and when the old controller reconnects it discovers it has been fenced. The correct outcome is that the stale controller resigns; the warnings are evidence the safety net caught it, not that data was lost.

Common Causes

  • Network partition / split brain. The controller broker was isolated from ZooKeeper or the KRaft quorum; a new controller was elected; the old one rejoins with a stale epoch and is fenced.
  • Long GC pause or freeze on the controller. A multi-second stop-the-world pause expires the controller’s session; election proceeds; the frozen broker wakes up still believing it is controller.
  • Delayed/replayed controller requests. In-flight requests from the previous controller arrive after a failover and are rejected by the new epoch.
  • Clock or VM pause (e.g., live migration, host stall) that suspends the controller long enough to lose leadership.
  • Manual or buggy intervention that restarts an old controller process which had cached an outdated epoch.

How to Reproduce the Error

On a throwaway 3-broker ZK cluster:

  1. Identify the controller and current epoch (get /controller_epoch).
  2. Network-isolate the controller from ZooKeeper only: sudo iptables -A OUTPUT -p tcp --dport 2181 -j DROP on that broker.
  3. Wait past zookeeper.session.timeout.ms; the remaining brokers elect a new controller and the epoch increments.
  4. Restore connectivity: sudo iptables -D OUTPUT -p tcp --dport 2181 -j DROP.
  5. The previously isolated broker rejoins, tries to act as controller, and logs “controller epoch … is older than … Resigning / Controller has been fenced.”

Use only in an isolated environment — this deliberately partitions a broker.

Diagnostic Commands

All read-only.

Check the current epoch and leader in KRaft (LeaderEpoch is the fencing epoch):

kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
LeaderId:               2
LeaderEpoch:            21
CurrentVoters:          [1,2,3]

For ZooKeeper mode, read the authoritative epoch and controller (legacy, read-only):

zookeeper-shell.sh localhost:2181 get /controller_epoch
zookeeper-shell.sh localhost:2181 get /controller

Confirm cluster reachability:

kafka-broker-api-versions.sh --bootstrap-server localhost:9092 | head -3

Find the fencing events and the two epochs involved:

grep -E "older than the current controller epoch|stale controller|has been fenced|Aborting controller startup" \
  /var/log/kafka/server.log | tail -30
journalctl -u kafka --since "2 hours ago" | grep -iE "epoch|fenced|partition|controller"

Check for GC pauses or host stalls that preceded the fencing:

grep -E "Pause|Full GC" /var/log/kafka/kafkaServer-gc.log | tail -20
dmesg | grep -iE "hung task|stall|clocksource" | tail -10

Step-by-Step Resolution

  1. Confirm the real controller and epoch with kafka-metadata-quorum.sh ... describe --status (KRaft) or get /controller + get /controller_epoch (ZK). The higher epoch is authoritative.
  2. Verify the stale controller resigned. The fenced broker’s log should show “Resigning” / “has been fenced”. Once it resigns, it rejoins as an ordinary broker — no action needed for the cluster’s correctness; fencing already protected state.
  3. Identify why it was fenced. Correlate the timestamp with network logs, dmesg, and GC logs. A partition, a long GC pause, or a VM stall is the usual root cause.
  4. Fix the underlying instability: repair the network path to ZooKeeper/quorum, tune GC to eliminate multi-second pauses, or address host-level stalls (noisy-neighbor VM, oversubscription).
  5. If the fenced broker keeps trying to act as controller (it didn’t fully resign), restart that broker process cleanly so it discovers the current epoch on startup.
  6. Validate: after recovery only one broker holds the controller role, the epoch is stable, and the fencing warnings stop. Check ISR and under-replicated partitions recovered to zero.

Prevention and Best Practices

  • Eliminate long GC pauses on controllers (G1/ZGC tuning, adequate heap); a pause longer than the session timeout is the most common split-brain trigger.
  • Raise zookeeper.session.timeout.ms (commonly 18000) so brief network blips don’t cause needless failovers and stale epochs.
  • Run controllers on stable, low-latency networks; avoid placing controller traffic across flaky links or oversubscribed hosts.
  • Trust the fencing: never try to “force” an old controller back into the role. The epoch check exists precisely to prevent two controllers from writing conflicting state.
  • Alert on controller epoch increments and on “stale controller epoch” / “has been fenced” warnings — frequent fencing means an unstable controller layer.
  • For incident triage, the free incident assistant can interpret the two epochs in the log and point at the likely partition or pause.
  • This is not the correct controller for this cluster — a stale view on a client after normal failover, without an epoch conflict.
  • Error while electing or becoming controller on broker N — election itself failing rather than a post-election fencing.
  • Controller not available — no current controller at all (quorum loss).
  • org.apache.zookeeper.KeeperException / session expiry — frequent precursor to epoch fencing.

Frequently Asked Questions

Did I lose data because of split brain? Almost certainly not. The whole point of the controller epoch is to fence the stale controller before it can apply conflicting changes. The warnings show the protection worked. Verify ISR and under-replicated partitions returned to zero to be sure.

Which epoch is correct, 18 or 21? The higher one (21). Epoch increases monotonically with each election. Any request carrying a lower epoch is from a superseded controller and is rejected.

Why did a healthy controller suddenly get fenced? It lost leadership while it was unaware — typically a network partition from ZooKeeper/quorum or a stop-the-world GC pause longer than the session timeout. When it returned, the cluster had already moved on.

Do I need to restart the fenced broker? Only if it fails to resign on its own. Normally it self-resigns and rejoins as a regular broker. A clean restart forces it to re-read the current epoch if it is stuck.

How is this different from KRaft? The mechanism is the same; KRaft uses the metadata quorum leader epoch instead of the /controller_epoch znode. A node that lost quorum leadership is fenced by the higher leader epoch identically.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.