Kafka Error Guide: 'Error for partition topic-0 at offset 12345' ReplicaFetcherThread Failure
Decode ReplicaFetcherThread errors when a Kafka follower can't fetch from the leader: NOT_LEADER, OFFSET_OUT_OF_RANGE, fetch size, and TLS causes.
- #kafka
- #troubleshooting
- #errors
- #replication
A ReplicaFetcherThread error means a follower broker tried to replicate a partition from its leader and failed. Unlike a slow follower that merely lags out of the ISR, this is an outright fetch failure: the follower cannot make progress at all until the underlying problem is fixed. The error text varies, but they all originate from kafka.server.ReplicaFetcherThread and they all point at the leader–follower fetch path.
Exact Error Message
On the follower broker (here, broker 1 fetching from leader broker 2), server.log shows:
[2026-06-29 16:41:07,233] ERROR [ReplicaFetcherThread-0-2] Error for partition topic-0 at offset 12345 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.NotLeaderOrFollowerException: This server is not the leader for that topic-partition.
[2026-06-29 16:41:07,233] WARN [ReplicaFetcherThread-0-2] Error processing fetch request for partition topic-0 (kafka.server.ReplicaFetcherThread)
[2026-06-29 16:41:12,560] WARN [ReplicaFetcherThread-0-2] Replica fetch failed for partition topic-0; unable to fetch partition (kafka.server.ReplicaFetcherThread)
The thread name encodes the topology: ReplicaFetcherThread-0-2 is fetcher 0 pulling from broker id 2. So this broker is the follower, broker 2 is the leader it cannot reach or get a clean response from. The error class after the first line is the most important detail.
What the Error Means
Followers replicate by sending fetch requests to the partition leader, exactly like a consumer. When that fetch returns an error code, or the connection fails, ReplicaFetcherThread logs it and backs off (replica.fetch.backoff.ms) before retrying.
The error class tells you the category:
NotLeaderOrFollowerException(NOT_LEADER_OR_FOLLOWER) — the follower’s metadata is stale; broker 2 is no longer the leader for that partition.UnknownTopicOrPartitionException— broker 2 does not have that partition (deleted, or metadata skew).OffsetOutOfRangeException(OFFSET_OUT_OF_RANGE) — the follower asked for an offset the leader no longer has or has not yet reached, triggering log truncation.RecordTooLargeException— a record on the leader exceeds the follower’sreplica.fetch.max.bytes, so the fetch can never return it.- Connection/auth failures — TLS handshake or SASL mismatch on the inter-broker listener, or the leader is simply unreachable.
Common Causes
- Leader unreachable / network. The follower cannot open or hold a connection to the leader’s inter-broker listener (firewall, DNS, leader down, NIC issue).
- Stale leadership (NOT_LEADER_OR_FOLLOWER). Leadership moved during a controller change or reassignment, and the follower is briefly fetching from the old leader. Usually self-healing; persistent occurrences indicate a metadata problem.
- OFFSET_OUT_OF_RANGE. The follower’s log diverged or fell so far behind that its requested offset is below the leader’s log start offset (after retention deletion) or above its log end offset. The follower must truncate to realign.
- Inter-broker auth / TLS mismatch. A rotated certificate, wrong truststore, or mismatched SASL mechanism on the
inter.broker.listener.namebreaks the fetch connection. - Fetch size too small (RecordTooLargeException).
replica.fetch.max.byteson the follower is smaller thanmessage.max.byteson the leader, so an oversized record can never be replicated and the partition stalls. - Leader log dir offline. With JBOD, the leader’s disk holding the partition went offline; the leader cannot serve the fetch.
How to Reproduce the Error
- NOT_LEADER: trigger a controlled leader change (reassignment or controlled shutdown) and tail the follower log during the transition window.
- OFFSET_OUT_OF_RANGE: stop a follower, let leader retention delete segments past the follower’s offset, then start the follower; it requests a now-deleted offset.
- RecordTooLargeException: set
message.max.byteshigh on the leader, produce a large record, and setreplica.fetch.max.bytessmaller on the follower. - TLS mismatch: point one broker at a truststore missing the CA used by the others and restart.
Diagnostic Commands
Confirm who the current leader and ISR actually are:
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --topic topic
Topic: topic Partition: 0 Leader: 2 Replicas: 1,2,3 Isr: 2,3
Here broker 1 is in Replicas but missing from Isr — consistent with a follower that cannot fetch. Pull the fetcher errors from the follower:
journalctl -u kafka --since "20 min ago" \
| grep "ReplicaFetcherThread"
ERROR [ReplicaFetcherThread-0-2] Error for partition topic-0 at offset 12345
... NotLeaderOrFollowerException: This server is not the leader ...
The exception class on the line after the ERROR is the root-cause signal. Check connectivity to the leader’s inter-broker listener (leader is broker 2):
ss -tnp | grep ':9093'
ESTAB 0 0 10.0.0.11:51234 10.0.0.12:9093 users:(("java",pid=4412,fd=212))
No established connection to the leader’s inter-broker port means a network or TLS problem. Inspect the listener and security config:
grep -E "listeners|advertised.listeners|inter.broker|security" \
/etc/kafka/server.properties
Confirm whether a log dir is offline on the leader and check offsets:
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
--describe --broker-list 2
Compare replica contents directly:
kafka-replica-verification.sh --broker-list localhost:9092 \
--topic-white-list 'topic'
Step-by-Step Resolution
-
Read the exception class first. The fix differs entirely depending on whether it is
NOT_LEADER_OR_FOLLOWER,OFFSET_OUT_OF_RANGE,RecordTooLargeException, or a connection error. Do not skip this. -
NOT_LEADER / UNKNOWN_TOPIC (transient). If
kafka-topics.sh --describeshows a healthy leader and the error stopped, this was a metadata transition and needs no action. If it persists, the follower has stale metadata — check controller health and the broker’s connectivity to the controller. -
Connection / TLS / auth. If
ssshows no connection to the leader’s inter-broker port, fix the path. Verifyadvertised.listenersresolves to the right address from the follower, that the firewall allows the inter-broker port, and that the truststore/keystore and SASL mechanism oninter.broker.listener.namematch across brokers. Inserver.properties:inter.broker.listener.name=INTERNAL advertised.listeners=INTERNAL://broker2.internal:9093,EXTERNAL://broker2.example.com:9092 -
RecordTooLargeException. Align the size limits so the follower can pull the largest record the leader accepts. The follower’s fetch ceiling must be at least the leader’s message ceiling:
message.max.bytes=10485760 replica.fetch.max.bytes=10485760 -
OFFSET_OUT_OF_RANGE. Kafka normally handles this automatically: the follower truncates to the leader’s log start or end offset and resumes. If it loops, the follower’s local log is corrupt or diverged — the standard recovery is to let it truncate, or in severe cases remove that partition’s local dir on the follower so it re-replicates from the leader. Verify retention is not so short that followers cannot keep up.
-
Leader log dir offline. If
kafka-log-dirs.shshows an offline dir on the leader, the underlying disk failed. Leadership should move to an in-sync replica; restore or replace the disk, then let replication backfill.
After the fix, re-run kafka-topics.sh --describe and confirm broker 1 rejoins the ISR.
Prevention and Best Practices
- Keep
message.max.bytesandreplica.fetch.max.bytesconsistent across every broker; a mismatch is a silent replication stall. - Pin
advertised.listenersto stable, resolvable names and test inter-broker reachability after every network change. - Automate certificate rotation for the inter-broker listener and validate it in staging before production.
- Alert on
UnderReplicatedPartitionsand onReplicaFetcherThreadERROR lines, not just on broker-down events. - Size retention so a briefly offline follower can resume without hitting OFFSET_OUT_OF_RANGE. Surfacing these fetch errors in an incident response dashboard shortens time-to-diagnosis.
Related Errors
Shrinking ISR / Expanding ISR— a follower that is merely lagging rather than failing outright. See the dedicated ISR flapping guide.NotEnoughReplicasException— surfaced to producers once fetch failures drop the ISR belowmin.insync.replicas.OffsetOutOfRangeExceptionon consumers — the client-side analogue of the follower truncation case.
More in the Kafka category.
Frequently Asked Questions
Q: What does the -0-2 in ReplicaFetcherThread-0-2 mean?
It is fetcher thread index 0 fetching from source broker id 2. So the broker logging the error is the follower, and broker 2 is the leader it is replicating from. This immediately tells you which two brokers and which network path to investigate.
Q: Are NOT_LEADER_OR_FOLLOWER errors during a reassignment dangerous? Usually not. Leadership moves during reassignments and controlled shutdowns, and followers briefly fetch from the old leader before metadata refreshes. A short burst that clears on its own is expected. A continuous stream means stale metadata or a controller problem.
Q: How do I fix OFFSET_OUT_OF_RANGE without losing data? Let the follower truncate and re-replicate from the leader; the leader is the source of truth, so the follower realigning is safe for that replica. Data loss only occurs if you force an unclean leader election to a replica that never had the records. Address the cause too: retention shorter than your worst-case follower downtime.
Q: Why is the partition still serving traffic if a follower can’t fetch?
The leader and any remaining in-sync replicas keep the partition online. The failing follower just drops out of the ISR. Risk rises only when failures accumulate and the ISR shrinks below min.insync.replicas, at which point acks=all produces start failing.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.