Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Shrinking ISR from 1,2,3 to 1,2' Replica Lag Flapping

Why Kafka logs 'Shrinking ISR' and 'Expanding ISR' for a partition, how replica.lag.time.max.ms drives it, and how to stabilize a flapping follower.

  • #kafka
  • #troubleshooting
  • #errors
  • #replication

If you tail a Kafka broker’s server.log long enough, you will eventually see a partition’s in-sync replica set (ISR) shrink and then expand a moment later. A single occurrence is normal. A partition that flaps in and out of the ISR every few seconds is a warning sign: one of your followers is marginal and is repeatedly falling behind the leader, then catching back up.

Exact Error Message

The leader broker logs ISR membership changes through kafka.cluster.Partition. A shrink followed quickly by an expand looks like this:

[2026-06-29 14:02:11,418] INFO [Partition topic-0 broker=1] Shrinking ISR from 1,2,3 to 1,2. Leader: (highWatermark: 884213, endOffset: 884221). Out of sync replicas: (brokerId: 3, endOffset: 883740, lastCaughtUpTimeMs: 1719669700912). (kafka.cluster.Partition)
[2026-06-29 14:02:11,420] INFO [Partition topic-0 broker=1] ISR updated to 1,2 and version updated to 47 (kafka.cluster.Partition)
[2026-06-29 14:02:39,902] INFO [Partition topic-0 broker=1] Expanding ISR from 1,2 to 1,2,3 (kafka.cluster.Partition)
[2026-06-29 14:02:39,905] INFO [Partition topic-0 broker=1] ISR updated to 1,2,3 and version updated to 48 (kafka.cluster.Partition)

These are logged at INFO, not ERROR, so they are easy to miss. The signal is the rate. Note that broker 3 is the replica being dropped, its endOffset trails the leader, and lastCaughtUpTimeMs shows how long ago it was last fully caught up.

What the Error Means

A partition leader maintains the ISR: the set of replicas that are sufficiently caught up to the leader’s log. A follower stays in the ISR as long as it has fetched up to the leader’s log end offset within replica.lag.time.max.ms (default 30000, 30 seconds).

When a follower’s last fully-caught-up time falls outside that window, the leader removes it from the ISR (“Shrinking ISR”). When the same follower catches back up to the leader’s log end offset, the leader re-adds it (“Expanding ISR”). The version number increments on every change and is propagated to the controller.

The consequence matters: with min.insync.replicas set to 2 and a shrink to a single replica, producers using acks=all will start getting NotEnoughReplicasException. Frequent shrink/expand also increases controller and metadata churn.

Common Causes

  1. Slow follower disk or long GC pauses. If broker 3 stalls on a flush or a stop-the-world GC pause longer than replica.lag.time.max.ms, its fetch thread cannot keep up and it drops out, then rejoins after the pause.
  2. Network saturation between brokers. Inter-broker replication competes with client traffic. A saturated NIC or a noisy neighbor on the replication path delays fetches.
  3. Follower broker overloaded. Too many partitions per ReplicaFetcherThread, CPU starvation, or page-cache pressure means the follower simply cannot fetch fast enough during traffic peaks.
  4. replica.lag.time.max.ms set too low. Lowering the default makes the ISR intolerant of brief, harmless hiccups, producing flapping where none is warranted.
  5. Produce bursts on the leader. A sudden spike in produce throughput temporarily outpaces the follower; the lag window trips before the follower drains the backlog.

How to Reproduce the Error

In a lab, you can induce a shrink reliably:

  • Create a 3-replica topic and drive a steady producer load.
  • On one follower, throttle replication bandwidth or pause the JVM (kill -STOP <pid> for a few seconds, then kill -CONT <pid>).
  • Within replica.lag.time.max.ms, the leader logs Shrinking ISR; after the follower resumes and catches up, you see Expanding ISR.

Alternatively, set replica.lag.time.max.ms=5000 on a busy cluster and watch a borderline follower start flapping during normal peaks.

Diagnostic Commands

Find which partitions are currently under-replicated:

kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions
Topic: topic   Partition: 0   Leader: 1   Replicas: 1,2,3   Isr: 1,2

If nothing is listed at the moment you run it, the partition has already re-expanded; the log history is the real evidence. Count the flap rate from the broker log:

journalctl -u kafka --since "30 min ago" \
  | grep -E "Shrinking ISR|Expanding ISR" | grep "topic-0"
... [Partition topic-0 broker=1] Shrinking ISR from 1,2,3 to 1,2 ...
... [Partition topic-0 broker=1] Expanding ISR from 1,2 to 1,2,3 ...
... [Partition topic-0 broker=1] Shrinking ISR from 1,2,3 to 1,2 ...

Repeated pairs for the same replica (broker 3) point at one marginal follower. Confirm the follower’s disk and partition distribution:

kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --broker-list 3
{"brokers":[{"broker":3,"logDirs":[{"logDir":"/var/lib/kafka","partitions":[
  {"partition":"topic-0","size":214748364,"offsetLag":481,"isFuture":false}]}]}]}

A persistent non-zero offsetLag on broker 3 confirms the follower trails. Check whether the fetcher thread is healthy on the follower:

journalctl -u kafka --since "30 min ago" \
  | grep "ReplicaFetcherThread"

You can also spot-check end-to-end replica consistency:

kafka-replica-verification.sh --broker-list localhost:9092 \
  --topic-white-list 'topic'

Step-by-Step Resolution

  1. Identify the offending follower. From the log pairs above, the replica that keeps appearing in the “from … to …” drop is the marginal one (broker 3 here). Focus remediation there.

  2. Rule out a too-aggressive lag window. If replica.lag.time.max.ms was lowered below the 30000 default, raise it back. A modest increase tolerates brief GC or burst hiccups without churning the ISR. In server.properties:

    replica.lag.time.max.ms=30000
  3. Give the follower more fetch parallelism. If one broker hosts many partitions, a single fetcher per source broker becomes the bottleneck. Increase fetcher threads:

    num.replica.fetchers=4
  4. Fix follower IO and GC. Confirm the data disk is not saturated (iostat), that the JVM heap is sized so GC pauses stay well under the lag window, and that page cache is not being thrashed. Long GC pauses are the single most common root cause of flapping.

  5. Relieve the network path. If inter-broker links are saturated, separate replication traffic onto a dedicated listener/NIC, or apply a replication quota during peaks so client and replication traffic do not starve each other.

  6. Smooth produce bursts. If a leader spike causes the drop, tune replica.fetch.max.bytes and replica.fetch.min.bytes so followers pull larger batches efficiently, and consider client-side batching/linger to flatten spikes.

After changing config and restarting the follower, re-run the journalctl | grep check. A healthy cluster shows occasional, isolated shrink/expand events, not a steady drumbeat for one replica.

Prevention and Best Practices

  • Alert on UnderReplicatedPartitions > 0 sustained for more than a minute, and on the shrink/expand log rate per partition.
  • Keep GC pause time (P99) at least an order of magnitude below replica.lag.time.max.ms.
  • Balance partitions across brokers so no single follower is overloaded.
  • Use replication quotas during reassignments so backfill does not push healthy followers out of the ISR.
  • Track ISR membership trends over time. A follower that flaps under load today will fail under more load tomorrow. Centralizing these signals in an incident response dashboard makes the pattern obvious before it becomes an outage.
  • ReplicaFetcherThread-0-N Error for partition — a follower that cannot fetch at all (not just lagging). See the related guide on fetcher failures.
  • NotEnoughReplicasException — produced to clients when a shrink drops the ISR below min.insync.replicas.
  • Leader: none / unavailable partitions — when all replicas leave the ISR and unclean leader election is disabled.

Browse more in the Kafka category.

Frequently Asked Questions

Q: Is a single “Shrinking ISR” message a problem? No. Isolated shrink/expand pairs are normal during deploys, brief GC pauses, or transient network blips. Treat the rate, not a single event, as the signal. A partition that flaps continuously for one replica is the real concern.

Q: Should I just raise replica.lag.time.max.ms to stop the flapping? Raising it from a too-low value back toward the 30000 default is correct. But pushing it far above the default only masks a slow follower. If the follower is genuinely behind, you are widening the window during which acks=all data can be lost on an unclean failover.

Q: Why does the ISR shrink but the partition stays online? As long as one in-sync replica (the leader) remains, the partition serves reads and writes. Producers with acks=all and min.insync.replicas=2 will fail once the ISR drops to one, but consumers and acks=1 producers continue.

Q: How do I know which broker is the bad follower? Read the shrink line: the replica that disappears from the “from X to Y” set is the one being removed, and the “Out of sync replicas” detail names its broker id and trailing offset. Confirm with kafka-log-dirs.sh --describe showing persistent offsetLag on that broker.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.