Kafka Error Guide: 'Error for partition topic-0 at offset

A ReplicaFetcherThread error means a follower broker tried to replicate a partition from its leader and failed. Unlike a slow follower that merely lags out of the ISR, this is an outright fetch failure: the follower cannot make progress at all until the underlying problem is fixed. The error text varies, but they all originate from kafka.server.ReplicaFetcherThread and they all point at the leader–follower fetch path.

Exact Error Message

On the follower broker (here, broker 1 fetching from leader broker 2), server.log shows:

[2026-06-29 16:41:07,233] ERROR [ReplicaFetcherThread-0-2] Error for partition topic-0 at offset 12345 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.NotLeaderOrFollowerException: This server is not the leader for that topic-partition.
[2026-06-29 16:41:07,233] WARN [ReplicaFetcherThread-0-2] Error processing fetch request for partition topic-0 (kafka.server.ReplicaFetcherThread)
[2026-06-29 16:41:12,560] WARN [ReplicaFetcherThread-0-2] Replica fetch failed for partition topic-0; unable to fetch partition (kafka.server.ReplicaFetcherThread)

The thread name encodes the topology: ReplicaFetcherThread-0-2 is fetcher 0 pulling from broker id 2. So this broker is the follower, broker 2 is the leader it cannot reach or get a clean response from. The error class after the first line is the most important detail.

What the Error Means

Followers replicate by sending fetch requests to the partition leader, exactly like a consumer. When that fetch returns an error code, or the connection fails, ReplicaFetcherThread logs it and backs off (replica.fetch.backoff.ms) before retrying.

The error class tells you the category:

NotLeaderOrFollowerException (NOT_LEADER_OR_FOLLOWER) — the follower’s metadata is stale; broker 2 is no longer the leader for that partition.
UnknownTopicOrPartitionException — broker 2 does not have that partition (deleted, or metadata skew).
OffsetOutOfRangeException (OFFSET_OUT_OF_RANGE) — the follower asked for an offset the leader no longer has or has not yet reached, triggering log truncation.
RecordTooLargeException — a record on the leader exceeds the follower’s replica.fetch.max.bytes, so the fetch can never return it.
Connection/auth failures — TLS handshake or SASL mismatch on the inter-broker listener, or the leader is simply unreachable.

Common Causes

Leader unreachable / network. The follower cannot open or hold a connection to the leader’s inter-broker listener (firewall, DNS, leader down, NIC issue).
Stale leadership (NOT_LEADER_OR_FOLLOWER). Leadership moved during a controller change or reassignment, and the follower is briefly fetching from the old leader. Usually self-healing; persistent occurrences indicate a metadata problem.
OFFSET_OUT_OF_RANGE. The follower’s log diverged or fell so far behind that its requested offset is below the leader’s log start offset (after retention deletion) or above its log end offset. The follower must truncate to realign.
Inter-broker auth / TLS mismatch. A rotated certificate, wrong truststore, or mismatched SASL mechanism on the inter.broker.listener.name breaks the fetch connection.
Fetch size too small (RecordTooLargeException). replica.fetch.max.bytes on the follower is smaller than message.max.bytes on the leader, so an oversized record can never be replicated and the partition stalls.
Leader log dir offline. With JBOD, the leader’s disk holding the partition went offline; the leader cannot serve the fetch.

How to Reproduce the Error

NOT_LEADER: trigger a controlled leader change (reassignment or controlled shutdown) and tail the follower log during the transition window.
OFFSET_OUT_OF_RANGE: stop a follower, let leader retention delete segments past the follower’s offset, then start the follower; it requests a now-deleted offset.
RecordTooLargeException: set message.max.bytes high on the leader, produce a large record, and set replica.fetch.max.bytes smaller on the follower.
TLS mismatch: point one broker at a truststore missing the CA used by the others and restart.

Diagnostic Commands

Confirm who the current leader and ISR actually are:

kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic topic

Topic: topic   Partition: 0   Leader: 2   Replicas: 1,2,3   Isr: 2,3

Here broker 1 is in Replicas but missing from Isr — consistent with a follower that cannot fetch. Pull the fetcher errors from the follower:

journalctl -u kafka --since "20 min ago" \
  | grep "ReplicaFetcherThread"

ERROR [ReplicaFetcherThread-0-2] Error for partition topic-0 at offset 12345
... NotLeaderOrFollowerException: This server is not the leader ...

The exception class on the line after the ERROR is the root-cause signal. Check connectivity to the leader’s inter-broker listener (leader is broker 2):

ss -tnp | grep ':9093'

ESTAB 0 0 10.0.0.11:51234 10.0.0.12:9093  users:(("java",pid=4412,fd=212))

No established connection to the leader’s inter-broker port means a network or TLS problem. Inspect the listener and security config:

grep -E "listeners|advertised.listeners|inter.broker|security" \
  /etc/kafka/server.properties

Confirm whether a log dir is offline on the leader and check offsets:

kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --broker-list 2

Compare replica contents directly:

kafka-replica-verification.sh --broker-list localhost:9092 \
  --topic-white-list 'topic'

Step-by-Step Resolution

Read the exception class first. The fix differs entirely depending on whether it is NOT_LEADER_OR_FOLLOWER, OFFSET_OUT_OF_RANGE, RecordTooLargeException, or a connection error. Do not skip this.
NOT_LEADER / UNKNOWN_TOPIC (transient). If kafka-topics.sh --describe shows a healthy leader and the error stopped, this was a metadata transition and needs no action. If it persists, the follower has stale metadata — check controller health and the broker’s connectivity to the controller.
Connection / TLS / auth. If ss shows no connection to the leader’s inter-broker port, fix the path. Verify advertised.listeners resolves to the right address from the follower, that the firewall allows the inter-broker port, and that the truststore/keystore and SASL mechanism on inter.broker.listener.name match across brokers. In server.properties:
```
inter.broker.listener.name=INTERNAL
advertised.listeners=INTERNAL://broker2.internal:9093,EXTERNAL://broker2.example.com:9092
```
RecordTooLargeException. Align the size limits so the follower can pull the largest record the leader accepts. The follower’s fetch ceiling must be at least the leader’s message ceiling:
```
message.max.bytes=10485760
replica.fetch.max.bytes=10485760
```
OFFSET_OUT_OF_RANGE. Kafka normally handles this automatically: the follower truncates to the leader’s log start or end offset and resumes. If it loops, the follower’s local log is corrupt or diverged — the standard recovery is to let it truncate, or in severe cases remove that partition’s local dir on the follower so it re-replicates from the leader. Verify retention is not so short that followers cannot keep up.
Leader log dir offline. If kafka-log-dirs.sh shows an offline dir on the leader, the underlying disk failed. Leadership should move to an in-sync replica; restore or replace the disk, then let replication backfill.

After the fix, re-run kafka-topics.sh --describe and confirm broker 1 rejoins the ISR.

Prevention and Best Practices

Keep message.max.bytes and replica.fetch.max.bytes consistent across every broker; a mismatch is a silent replication stall.
Pin advertised.listeners to stable, resolvable names and test inter-broker reachability after every network change.
Automate certificate rotation for the inter-broker listener and validate it in staging before production.
Alert on UnderReplicatedPartitions and on ReplicaFetcherThread ERROR lines, not just on broker-down events.
Size retention so a briefly offline follower can resume without hitting OFFSET_OUT_OF_RANGE. Surfacing these fetch errors in an incident response dashboard shortens time-to-diagnosis.

Shrinking ISR / Expanding ISR — a follower that is merely lagging rather than failing outright. See the dedicated ISR flapping guide.
NotEnoughReplicasException — surfaced to producers once fetch failures drop the ISR below min.insync.replicas.
OffsetOutOfRangeException on consumers — the client-side analogue of the follower truncation case.

Frequently Asked Questions

Q: What does the -0-2 in ReplicaFetcherThread-0-2 mean? It is fetcher thread index 0 fetching from source broker id 2. So the broker logging the error is the follower, and broker 2 is the leader it is replicating from. This immediately tells you which two brokers and which network path to investigate.

Q: Are NOT_LEADER_OR_FOLLOWER errors during a reassignment dangerous? Usually not. Leadership moves during reassignments and controlled shutdowns, and followers briefly fetch from the old leader before metadata refreshes. A short burst that clears on its own is expected. A continuous stream means stale metadata or a controller problem.

Q: How do I fix OFFSET_OUT_OF_RANGE without losing data? Let the follower truncate and re-replicate from the leader; the leader is the source of truth, so the follower realigning is safe for that replica. Data loss only occurs if you force an unclean leader election to a replica that never had the records. Address the cause too: retention shorter than your worst-case follower downtime.

Q: Why is the partition still serving traffic if a follower can’t fetch? The leader and any remaining in-sync replicas keep the partition online. The failing follower just drops out of the ISR. Risk rises only when failures accumulate and the ISR shrinks below min.insync.replicas, at which point acks=all produces start failing.

Kafka Error Guide: 'Error for partition topic-0 at offset 12345' ReplicaFetcherThread Failure

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Related Errors

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit