Kafka Error Guide: 'Shrinking ISR' replica lagging and under-replicated partitions
Fix Kafka followers that lag and drop out of ISR causing under-replicated partitions: slow disk, NIC saturation, fetchers, and leftover replication throttles.
- #kafka
- #troubleshooting
- #errors
- #replication
A follower replica that cannot keep up with its leader gets ejected from the in-sync replica set (ISR), and the partition becomes under-replicated. A brief shrink-then-expand is routine. A follower that lags for minutes or hours is a durability risk and almost always points at a resource bottleneck or, surprisingly often, a replication throttle that was set during a reassignment and never removed. This guide focuses on that sustained lag.
Exact Error Message
Sustained lag shows up on the leader broker’s server.log:
[2026-06-29 11:02:17,884] WARN [Partition payments-3 broker=1] Shrinking ISR from 1,2,3 to 1,2. Leader: (highWatermark: 884512230, endOffset: 884693001). Out of sync replicas: (brokerId: 3, endOffset: 884419772, lastCaughtUpTimeMs: 1719658927112). (kafka.cluster.Partition)
[2026-06-29 11:02:17,901] INFO [Partition payments-3 broker=1] ISR updated to 1,2 (under-replicated) (kafka.cluster.Partition)
[2026-06-29 11:05:44,233] WARN [ReplicaManager broker=1] Replica 3 for partition payments-3 is lagging behind by 273229 messages; lastCaughtUpTimeMs ago: 211000ms. (kafka.server.ReplicaManager)
And on the cluster JMX metric:
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Value = 6
The tell is the gap between the leader’s endOffset and the follower’s endOffset, plus a lastCaughtUpTimeMs that keeps receding. Unlike a healthy ISR shrink that immediately expands back, this follower stays out and the under-replicated count stays above zero.
What the Error Means
Every follower runs fetcher threads that pull records from the partition leader. A follower is “in sync” if it has fetched up to (or within replica.lag.time.max.ms of) the leader’s log end offset. When a follower falls further behind than that window, the leader removes it from the ISR and logs Shrinking ISR. The partition is now under-replicated: it has fewer in-sync copies than its replication factor.
This matters for two reasons. First, durability: if you set min.insync.replicas=2 and the ISR drops to 1, producers using acks=all start failing with NOT_ENOUGH_REPLICAS. Second, availability: a leader can only fail over cleanly to an in-sync replica, so a shrunken ISR narrows your safe-failover options. Persistent lag means the follower cannot read from the leader as fast as the leader is writing.
Common Causes
- Slow or failing disk on the follower. The follower writes fetched data to its own log. If that disk is saturated or degrading, fetch-then-write throughput drops below the incoming write rate and the follower falls permanently behind.
- NIC saturation / network bottleneck. Replication traffic competes with client traffic on the same interface. A saturated link or a noisy cross-AZ path starves the fetchers.
- Under-provisioned
num.replica.fetchers. With too few fetcher threads, a broker that is a follower for many busy partitions cannot parallelize enough to keep up. - Leftover replication throttle. A past partition reassignment set
leader.replication.throttled.rate/follower.replication.throttled.rateto protect client traffic. If the reassignment was verified but the throttle was never cleared, replication is permanently rate-limited and followers can never catch up under load. This is the most commonly missed cause. - Long GC pauses. Stop-the-world pauses on the follower stall its fetcher threads; the follower repeatedly drops out and rejoins.
- Hot partition. A single partition receiving disproportionate write volume can outrun a follower whose resources are shared across many partitions.
How to Reproduce the Error
The throttle case is the most instructive and the most realistic in production. Start a reassignment with an aggressive throttle, let it finish, and “forget” to remove the throttle:
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
--reassignment-json-file plan.json --verify
With a low throttle still active and steady producer load, the moved replicas never finish catching up and the partition stays under-replicated indefinitely. To reproduce the disk variant, throttle the follower’s disk I/O (for example with cgroups or by colocating a heavy workload) while producing at a high sustained rate.
Diagnostic Commands
Confirm the scope first. List every under-replicated partition cluster-wide:
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --under-replicated-partitions
Topic: payments Partition: 3 Leader: 1 Replicas: 1,2,3 Isr: 1,2
Topic: payments Partition: 7 Leader: 2 Replicas: 2,3,1 Isr: 2,1
The broker missing from Isr (broker 3 here) is your lagging follower. Compare log sizes and offsets across brokers to see how far behind it is:
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
--describe --topic-list payments
broker 1: payments-3 size: 91234881024 offsetLag: 0
broker 3: payments-3 size: 88112440832 offsetLag: 273229
Critically, check whether a throttle is still in force. --verify reports it explicitly:
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
--reassignment-json-file plan.json --verify
Status of partition reassignment:
Reassignment of partition payments-3 is completed.
Clearing broker-level throttles on brokers 1,2,3
Throttle was not removed: leader.replication.throttled.rate is still set on broker 1
That last line means a leftover throttle is capping replication. Pull supporting evidence from the leader’s log and the journal:
grep -E "Shrinking ISR|lagging behind" /opt/kafka/logs/server.log | tail -n 20
journalctl -u kafka --since "30 min ago" | grep -iE "gc|disk|i/o error"
Step-by-Step Resolution
A worked example. UnderReplicatedPartitions has sat at 6 for an hour following a rebalance the night before.
- Scope it.
kafka-topics.sh --describe --under-replicated-partitionsshows all six partitions are followed by broker 3, and broker 3 is the one consistently missing from each ISR. So this is a single-follower problem, not a cluster-wide overload. - Measure the lag.
kafka-log-dirs.sh --describeshows broker 3’s copies trailing by hundreds of thousands of offsets and the gap not shrinking over successive runs. - Check for a throttle.
kafka-reassign-partitions.sh --verifyagainst last night’s plan prints “Throttle was not removed” and showsfollower.replication.throttled.ratestill set. There it is: last night’s reassignment left the throttle behind, and broker 3 is being rate-limited below the live write rate. - Remove the throttle. Clear the leftover
leader.replication.throttled.rateandfollower.replication.throttled.rate(and the throttled-replicas configs) at the broker and topic level usingkafka-configs.shto delete those dynamic configs. With the cap gone, broker 3’s fetchers run at full speed, the offset gap closes within minutes, the ISR expands back to1,2,3, andUnderReplicatedPartitionsreturns to 0.
If --verify shows no throttle, move to resources: confirm follower disk health and free I/O headroom, check NIC utilization, and if the broker follows many busy partitions, raise num.replica.fetchers (and restart) so it can fetch in parallel. For GC pauses, tune heap and the garbage collector.
Prevention and Best Practices
- Always confirm
--verifyreports the throttle as removed after a reassignment; treat a lingering throttle as an incident, not a cosmetic issue. - Alarm on
UnderReplicatedPartitions > 0sustained for more than a few minutes; brief blips during deploys are fine, sustained values are not. - Right-size
num.replica.fetchersfor brokers that follow many partitions, and keep replication and client traffic on adequately provisioned NICs. - Monitor per-broker disk latency and I/O wait; a degrading disk shows up as replica lag long before it fully fails.
- Keep
replica.lag.time.max.msat a sane value so transient GC pauses do not eject healthy followers, and keepmin.insync.replicas=2with replication factor 3 for durability. - Wire under-replication alerts into an automated runbook via the incident response dashboard.
Related Errors
- NOT_ENOUGH_REPLICAS / NOT_ENOUGH_REPLICAS_AFTER_APPEND - producers with
acks=allfail when the ISR drops belowmin.insync.replicasbecause of this lag. - OfflinePartitionsCount > 0 - the worse outcome when every replica, including the leader, is unavailable.
- Shrinking ISR followed immediately by Expanding ISR - the benign, transient version of these same log lines.
- LEADER_NOT_AVAILABLE - what clients see when a leaderless or recovering partition cannot serve requests.
Frequently Asked Questions
Q: How do I know if a leftover throttle is causing my lag?
Run kafka-reassign-partitions.sh --verify against the most recent reassignment plan. If it reports that a throttle “was not removed” or that leader.replication.throttled.rate is still set, a stale throttle is rate-limiting replication and must be cleared.
Q: Is a brief ISR shrink something to worry about?
No. A shrink that immediately expands back during a deploy, restart, or short GC pause is normal. Worry when the follower stays out of the ISR and UnderReplicatedPartitions remains above zero for several minutes.
Q: Will adding more replica fetchers always fix lag? Only when fetcher parallelism is the bottleneck, typically on brokers that follow many busy partitions. If the real limit is a saturated disk, a saturated NIC, or a leftover throttle, more fetchers will not help and may add contention.
Q: What is the difference between this and an offline partition? A lagging follower means the partition is under-replicated but still has a working leader and can serve traffic. An offline partition has no available leader at all. For more replication and failover walkthroughs see the Kafka category.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.