Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Unable to fetch metadata log' Follower Far Behind

Fix KRaft 'Unable to fetch metadata log' / 'Unable to catch up to metadata log': diagnose a follower controller or broker lagging the __cluster_metadata leader.

  • #kafka
  • #troubleshooting
  • #errors
  • #kraft

Exact Error Message

A follower in the KRaft quorum (another controller, or a broker consuming metadata) replicates the __cluster_metadata log from the leader. When it cannot keep up — or cannot fetch at all — you see it stall far behind the leader’s high watermark:

[2026-06-29 11:18:43,771] WARN [RaftManager id=3] Unable to fetch metadata log from leader 1 at offset 9,842,556: request timed out after 30000 ms (org.apache.kafka.raft.KafkaRaftClient)
[2026-06-29 11:18:44,902] WARN [BrokerMetadataPublisher id=21] Unable to catch up to metadata log; current offset 9,801,233 trails leader high watermark 9,842,560 by 41,327 records (kafka.server.metadata.BrokerMetadataPublisher)
[2026-06-29 11:18:46,015] INFO [RaftManager id=3] Follower 3 fetch returned NOT_LEADER_OR_FOLLOWER; resetting fetch position (org.apache.kafka.raft.KafkaRaftClient)

The follower’s offset (9,801,233) stays well below the leader high watermark (9,842,560) and the gap does not shrink.

What the Error Means

In KRaft, the active controller is the Raft leader for __cluster_metadata; every other controller and every broker is a follower that fetches new records and applies them. “Unable to fetch metadata log” means the follower’s fetch requests to the leader are failing or timing out. “Unable to catch up to metadata log” means fetches are partially working but the follower is falling behind faster than it can apply, so its applied offset trails the leader’s high watermark by a growing margin.

A follower that is too far behind cannot serve up-to-date metadata. A lagging controller does not count as in-sync and weakens quorum resilience; a lagging broker serves stale partition leadership and may fence clients. If a follower lags past the point where the leader has already snapshotted and truncated the log, it must fetch a snapshot to recover.

Common Causes

  • Network problems to the leader — packet loss, saturated links, or intermittent connectivity on the controller listener cause fetch timeouts.
  • Slow disk on the follower — applying metadata records or writing the local replica is I/O-bound; a saturated or failing disk makes apply lag grow.
  • The follower was offline for a long time and the leader has since snapshotted and truncated the log past the follower’s position, forcing a snapshot fetch that is large or failing.
  • Leadership changed mid-fetch (NOT_LEADER_OR_FOLLOWER), so the follower keeps resetting its fetch position against a stale leader.
  • Undersized fetch settings or overloaded leader, where a high metadata change rate (massive topic/partition counts) outpaces the follower.
  • Clock skew or GC pauses on the follower causing repeated fetch timeouts.

How to Reproduce the Error

Take a follower controller offline long enough for the leader to advance and snapshot, then bring it back while throttling its network to the leader:

# Lab only: stop a follower controller for a while, generate metadata churn
sudo systemctl stop kafka   # on follower controller 3
# ...create/delete many topics against the active controller...

# Throttle the follower's path to the leader, then restart
sudo tc qdisc add dev eth0 root netem delay 800ms loss 5%
sudo systemctl start kafka

kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --replication

The restarted follower logs “Unable to fetch metadata log” / “Unable to catch up” and its LogEndOffset in the replication view stays far below the leader.

Diagnostic Commands

All read-only.

# How far behind is each follower? Watch LogEndOffset vs leader, and Lag
kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --replication

# Leader/high-watermark/epoch overview
kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --status

# Inspect the local metadata segments on the lagging node (read-only decode)
kafka-dump-log.sh --cluster-metadata-decoder \
  --files /var/lib/kafka/__cluster_metadata-0/*.log | tail -40

# Is the follower fetching at all? Look for timeouts / NOT_LEADER
grep -iE 'unable to fetch|catch up|fetch.*timed out|NOT_LEADER_OR_FOLLOWER|snapshot' \
  /var/log/kafka/controller.log | tail -50

# Disk and GC pressure on the follower
iostat -x 1 3
journalctl -u kafka --since "20 min ago" | grep -iE 'gc|pause|disk|io'

In describe --replication, compare each row’s LogEndOffset and Lag to the leader. A single follower with large, non-shrinking Lag is the one failing to catch up.

Step-by-Step Resolution

  1. Confirm which node lags and by how much via describe --replication. A Lag that holds steady or grows confirms it is not catching up.
  2. Test connectivity from the follower to the leader on the controller listener. Fix packet loss, MTU, or firewall issues degrading fetches.
  3. Check follower disk and GC. If iostat shows the disk saturated or GC pauses are long, address I/O (faster disk, less contention) or heap/GC tuning so applies keep pace.
  4. If the follower is past the log’s truncation point, let it fetch a snapshot. KRaft will transfer the latest snapshot automatically; ensure the follower has disk space and a clean network to receive it. Do not delete the local metadata dir blindly.
  5. If leadership is flapping (NOT_LEADER_OR_FOLLOWER repeatedly), stabilize the leader first (see leader-election and quorum guides) so the follower has a steady target.
  6. Restart the follower cleanly after the underlying cause is fixed and watch Lag shrink to near zero, then confirm it is back in-sync.

Prevention and Best Practices

  • Put controller metadata on fast, dedicated storage (low-latency SSD/NVMe) so apply never becomes the bottleneck.
  • Keep controller-to-controller and broker-to-controller networks low-latency and lossless; metadata replication is sensitive to both.
  • Avoid leaving controllers offline for long; the longer the outage, the larger the catch-up (and possible snapshot) on return.
  • Monitor follower Lag from describe --replication and alert when any follower trails persistently.
  • Keep metadata change rate sane — extreme topic/partition counts inflate the metadata log and make catch-up harder.
  • Tune GC and heap on controllers so stop-the-world pauses do not trigger fetch timeouts.
  • Snapshot generation failed — if the leader cannot snapshot, followers that need a snapshot to catch up are stuck.
  • Raft leader election failed — leadership churn that keeps resetting follower fetch positions.
  • Metadata loader failed — a broker that fetches metadata fine but cannot apply it. See more in the Kafka guides.

Frequently Asked Questions

What is the difference between “unable to fetch” and “unable to catch up”? “Unable to fetch” means the fetch request itself is failing/timing out. “Unable to catch up” means fetches succeed but the follower applies slower than the leader produces, so lag grows.

When does a follower need a snapshot instead of log records? When its position is older than the oldest retained log offset — the leader has snapshotted and truncated past it, so it must load a snapshot then resume from the log.

Does a lagging broker serve stale data? It can serve stale metadata (partition leadership, configs) until it catches up, which may misroute or fence clients. Catching it up resolves this.

Is deleting the local __cluster_metadata-0 dir a fix? Only as a last resort and per a tested runbook — it forces a full re-replication and can be destructive if done on the wrong node. Fix network/disk first.

How much lag is acceptable? Near zero in steady state. Brief lag during heavy metadata churn is fine; persistent, growing lag is the problem.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.