Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Controller heartbeat timeout' Broker Fenced in KRaft

Fix Kafka KRaft 'controller heartbeat timeout / broker fenced': tune broker.heartbeat.interval.ms and broker.session.timeout.ms, and diagnose missed broker heartbeats.

  • #kafka
  • #troubleshooting
  • #errors
  • #controller

Exact Error Message

In a KRaft cluster a broker stops sending timely heartbeats to the active controller and gets fenced. On the broker in server.log:

[2026-06-28 03:12:55,604] WARN [BrokerLifecycleManager id=5] Broker 5 sent a heartbeat to the
 controller but did not receive a response in time. (kafka.server.BrokerLifecycleManager)
[2026-06-28 03:13:09,881] ERROR [BrokerLifecycleManager id=5] Heartbeat to controller timed out;
 the broker has been fenced for not sending heartbeats within broker.session.timeout.ms=9000
 (kafka.server.BrokerLifecycleManager)

On the active controller:

[2026-06-28 03:13:09,902] INFO [QuorumController id=1] Fencing broker 5 because its session has timed out;
 last heartbeat was 9213 ms ago (org.apache.kafka.controller.BrokerHeartbeatManager)
[2026-06-28 03:13:09,905] INFO [QuorumController id=1] Removing broker 5 from ISR for 142 partitions
 (org.apache.kafka.controller.ReplicationControlManager)

What the Error Means

KRaft replaces ZooKeeper-based broker registration with a heartbeat protocol: every broker periodically sends a heartbeat to the active (quorum) controller every broker.heartbeat.interval.ms (default 2000 ms). The controller tracks the last heartbeat per broker. If a broker fails to heartbeat within broker.session.timeout.ms (default 9000 ms), the controller declares its session expired and fences it — marking it offline, removing it from the ISR of every partition it hosts, and triggering leader election for partitions it led.

“Controller heartbeat timeout” / “broker fenced for not sending heartbeats” therefore means the controller stopped hearing from a broker in time. The broker may still be alive and serving local sockets, but from the cluster’s control plane it is considered dead. This is the KRaft equivalent of a ZooKeeper session expiry. It is a liveness/connectivity problem, not a controller bug: the controller is doing exactly what it should — evicting a broker it can no longer confirm is healthy, to keep replication safe. The fallout is real, though: under-replicated partitions, leadership churn, and a fenced broker that must re-register and catch up. Fixing it means finding why the heartbeats stopped (GC pause, network, overload, or too-tight timeouts) rather than touching the controller.

Common Causes

  • Long GC pauses on the broker that suspend the heartbeat thread past broker.session.timeout.ms.
  • Network latency or packet loss between the broker and the active controller’s listener.
  • Broker overload (CPU saturation, request-handler thread exhaustion, disk I/O stalls) delaying heartbeats.
  • Session timeout set too low relative to real-world jitter (broker.session.timeout.ms too small or broker.heartbeat.interval.ms too large).
  • Controller-side overload or quorum instability so heartbeat responses are slow.
  • Host-level stalls — VM live migration, hypervisor pause, or clock jumps that freeze the broker briefly.

How to Reproduce the Error

On a disposable KRaft cluster:

  1. Lower broker.session.timeout.ms to its floor and shrink the margin (test only): e.g. broker.session.timeout.ms=4000, broker.heartbeat.interval.ms=2000.
  2. Inject a stall on one broker that exceeds the timeout — pause the JVM with kill -STOP <pid> for 6 seconds, then kill -CONT <pid>.
  3. The controller logs “Fencing broker N because its session has timed out” and removes it from ISRs; the broker logs “has been fenced for not sending heartbeats”.
  4. After kill -CONT, the broker re-registers and rejoins, catching up replicas.

Use only in a throwaway environment — this deliberately stalls a broker.

Diagnostic Commands

All read-only.

Confirm controller/quorum health (a slow controller can delay heartbeats):

kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --replication

See which partitions lost the fenced broker from their ISR:

kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

Confirm the broker’s effective timeout settings and reachability:

kafka-broker-api-versions.sh --bootstrap-server localhost:9092 | head -3
grep -E "broker.session.timeout.ms|broker.heartbeat.interval.ms" /etc/kafka/server.properties

Pull the heartbeat/fencing events:

grep -E "did not receive a response in time|has been fenced|sent a heartbeat|Fencing broker" \
  /var/log/kafka/server.log | tail -40
journalctl -u kafka --since "1 hour ago" | grep -iE "heartbeat|fenc|session.*time"

Check for the usual culprits — GC pauses and host stalls:

grep -E "Pause|Full GC|Total time for which" /var/log/kafka/kafkaServer-gc.log | tail -20
dmesg | grep -iE "hung task|stall|clocksource|soft lockup" | tail -10

Step-by-Step Resolution

  1. Confirm the broker actually missed heartbeats (logs show “fenced for not sending heartbeats”) and identify the timing — how long the gap was vs broker.session.timeout.ms.
  2. Find the cause of the gap. Correlate the fencing timestamp with GC logs (a pause longer than the session timeout), dmesg (host stalls), and network metrics (latency/loss to the controller). The GC log is the single most common smoking gun.
  3. Address the root cause: tune broker GC to eliminate multi-second pauses, relieve CPU/disk overload, or repair the network path to the controller listener.
  4. Right-size the timeouts if jitter is legitimate. Increasing broker.session.timeout.ms (e.g. to 18000) and/or lowering broker.heartbeat.interval.ms gives more margin — but only after fixing genuine stalls; loosening timeouts to mask real problems delays failure detection.
  5. Let the broker rejoin. Once heartbeats resume, the controller un-fences it and it re-enters ISRs. Watch under-replicated partitions return to zero.
  6. Validate: --under-replicated-partitions is empty, the broker is back in describe --replication, and no further fencing appears in the logs.

Prevention and Best Practices

  • Eliminate long GC pauses on brokers (G1/ZGC, right-sized heap, off-heap page cache); a pause exceeding broker.session.timeout.ms is the top cause of heartbeat fencing.
  • Keep broker.heartbeat.interval.ms well below broker.session.timeout.ms (default 2000 vs 9000 gives ~4 missed beats of margin); raise the session timeout for high-jitter environments rather than shrinking the interval.
  • Run brokers and controllers on low-latency networks; monitor RTT to the controller listener and alert on spikes.
  • Avoid host-level stalls — pin away from noisy neighbors, beware VM live migration on Kafka hosts, and keep NTP/clocksource healthy.
  • Alert on broker fencing events and on under-replicated partitions; repeated fencing of the same broker points to a host or GC problem.
  • For fast triage, the free incident assistant can correlate the fencing timestamp with GC pauses and suggest the likely cause.
  • Controller not available — no controller to heartbeat to at all (quorum loss), versus a reachable controller that fenced the broker.
  • This is not the correct controller for this cluster — a broker heartbeating the wrong/old controller after failover.
  • org.apache.zookeeper.KeeperException$SessionExpiredException — the ZooKeeper-mode analog of session expiry/fencing.
  • Under-replicated partitions alerts — the direct consequence of a fenced broker leaving ISRs.

Frequently Asked Questions

Is the controller at fault when my broker gets fenced? Usually not. The controller fences a broker because it stopped receiving heartbeats in time. The fix is almost always on the broker side (GC, overload, network) or in the timeout sizing — not the controller.

What’s the difference between broker.heartbeat.interval.ms and broker.session.timeout.ms? The interval (default 2000 ms) is how often the broker heartbeats. The session timeout (default 9000 ms) is how long the controller waits before fencing a silent broker. The session timeout must be comfortably larger than the interval to tolerate a few missed beats.

My broker was alive the whole time — why was it fenced? From the control plane’s view it went silent. A GC pause, host stall, or network blip longer than broker.session.timeout.ms makes a live broker look dead. Check the GC log first.

Does fencing cause data loss? No, but it causes under-replication and leader elections for the broker’s partitions. As long as other in-sync replicas exist, data is safe; the broker re-replicates after rejoining.

Should I just raise broker.session.timeout.ms to stop the fencing? Only after ruling out real stalls. A larger timeout adds jitter tolerance but also slows detection of genuinely dead brokers. Fix GC/network/host issues first, then size the timeout to realistic worst-case latency.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.