Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Shrinking ISR' replica lagging and under-replicated partitions

Fix Kafka followers that lag and drop out of ISR causing under-replicated partitions: slow disk, NIC saturation, fetchers, and leftover replication throttles.

  • #kafka
  • #troubleshooting
  • #errors
  • #replication

A follower replica that cannot keep up with its leader gets ejected from the in-sync replica set (ISR), and the partition becomes under-replicated. A brief shrink-then-expand is routine. A follower that lags for minutes or hours is a durability risk and almost always points at a resource bottleneck or, surprisingly often, a replication throttle that was set during a reassignment and never removed. This guide focuses on that sustained lag.

Exact Error Message

Sustained lag shows up on the leader broker’s server.log:

[2026-06-29 11:02:17,884] WARN [Partition payments-3 broker=1] Shrinking ISR from 1,2,3 to 1,2. Leader: (highWatermark: 884512230, endOffset: 884693001). Out of sync replicas: (brokerId: 3, endOffset: 884419772, lastCaughtUpTimeMs: 1719658927112). (kafka.cluster.Partition)
[2026-06-29 11:02:17,901] INFO [Partition payments-3 broker=1] ISR updated to 1,2 (under-replicated) (kafka.cluster.Partition)
[2026-06-29 11:05:44,233] WARN [ReplicaManager broker=1] Replica 3 for partition payments-3 is lagging behind by 273229 messages; lastCaughtUpTimeMs ago: 211000ms. (kafka.server.ReplicaManager)

And on the cluster JMX metric:

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions  Value = 6

The tell is the gap between the leader’s endOffset and the follower’s endOffset, plus a lastCaughtUpTimeMs that keeps receding. Unlike a healthy ISR shrink that immediately expands back, this follower stays out and the under-replicated count stays above zero.

What the Error Means

Every follower runs fetcher threads that pull records from the partition leader. A follower is “in sync” if it has fetched up to (or within replica.lag.time.max.ms of) the leader’s log end offset. When a follower falls further behind than that window, the leader removes it from the ISR and logs Shrinking ISR. The partition is now under-replicated: it has fewer in-sync copies than its replication factor.

This matters for two reasons. First, durability: if you set min.insync.replicas=2 and the ISR drops to 1, producers using acks=all start failing with NOT_ENOUGH_REPLICAS. Second, availability: a leader can only fail over cleanly to an in-sync replica, so a shrunken ISR narrows your safe-failover options. Persistent lag means the follower cannot read from the leader as fast as the leader is writing.

Common Causes

  1. Slow or failing disk on the follower. The follower writes fetched data to its own log. If that disk is saturated or degrading, fetch-then-write throughput drops below the incoming write rate and the follower falls permanently behind.
  2. NIC saturation / network bottleneck. Replication traffic competes with client traffic on the same interface. A saturated link or a noisy cross-AZ path starves the fetchers.
  3. Under-provisioned num.replica.fetchers. With too few fetcher threads, a broker that is a follower for many busy partitions cannot parallelize enough to keep up.
  4. Leftover replication throttle. A past partition reassignment set leader.replication.throttled.rate / follower.replication.throttled.rate to protect client traffic. If the reassignment was verified but the throttle was never cleared, replication is permanently rate-limited and followers can never catch up under load. This is the most commonly missed cause.
  5. Long GC pauses. Stop-the-world pauses on the follower stall its fetcher threads; the follower repeatedly drops out and rejoins.
  6. Hot partition. A single partition receiving disproportionate write volume can outrun a follower whose resources are shared across many partitions.

How to Reproduce the Error

The throttle case is the most instructive and the most realistic in production. Start a reassignment with an aggressive throttle, let it finish, and “forget” to remove the throttle:

kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file plan.json --verify

With a low throttle still active and steady producer load, the moved replicas never finish catching up and the partition stays under-replicated indefinitely. To reproduce the disk variant, throttle the follower’s disk I/O (for example with cgroups or by colocating a heavy workload) while producing at a high sustained rate.

Diagnostic Commands

Confirm the scope first. List every under-replicated partition cluster-wide:

kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions
Topic: payments  Partition: 3  Leader: 1  Replicas: 1,2,3  Isr: 1,2
Topic: payments  Partition: 7  Leader: 2  Replicas: 2,3,1  Isr: 2,1

The broker missing from Isr (broker 3 here) is your lagging follower. Compare log sizes and offsets across brokers to see how far behind it is:

kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --topic-list payments
broker 1: payments-3  size: 91234881024  offsetLag: 0
broker 3: payments-3  size: 88112440832  offsetLag: 273229

Critically, check whether a throttle is still in force. --verify reports it explicitly:

kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file plan.json --verify
Status of partition reassignment:
Reassignment of partition payments-3 is completed.
Clearing broker-level throttles on brokers 1,2,3
Throttle was not removed: leader.replication.throttled.rate is still set on broker 1

That last line means a leftover throttle is capping replication. Pull supporting evidence from the leader’s log and the journal:

grep -E "Shrinking ISR|lagging behind" /opt/kafka/logs/server.log | tail -n 20
journalctl -u kafka --since "30 min ago" | grep -iE "gc|disk|i/o error"

Step-by-Step Resolution

A worked example. UnderReplicatedPartitions has sat at 6 for an hour following a rebalance the night before.

  1. Scope it. kafka-topics.sh --describe --under-replicated-partitions shows all six partitions are followed by broker 3, and broker 3 is the one consistently missing from each ISR. So this is a single-follower problem, not a cluster-wide overload.
  2. Measure the lag. kafka-log-dirs.sh --describe shows broker 3’s copies trailing by hundreds of thousands of offsets and the gap not shrinking over successive runs.
  3. Check for a throttle. kafka-reassign-partitions.sh --verify against last night’s plan prints “Throttle was not removed” and shows follower.replication.throttled.rate still set. There it is: last night’s reassignment left the throttle behind, and broker 3 is being rate-limited below the live write rate.
  4. Remove the throttle. Clear the leftover leader.replication.throttled.rate and follower.replication.throttled.rate (and the throttled-replicas configs) at the broker and topic level using kafka-configs.sh to delete those dynamic configs. With the cap gone, broker 3’s fetchers run at full speed, the offset gap closes within minutes, the ISR expands back to 1,2,3, and UnderReplicatedPartitions returns to 0.

If --verify shows no throttle, move to resources: confirm follower disk health and free I/O headroom, check NIC utilization, and if the broker follows many busy partitions, raise num.replica.fetchers (and restart) so it can fetch in parallel. For GC pauses, tune heap and the garbage collector.

Prevention and Best Practices

  • Always confirm --verify reports the throttle as removed after a reassignment; treat a lingering throttle as an incident, not a cosmetic issue.
  • Alarm on UnderReplicatedPartitions > 0 sustained for more than a few minutes; brief blips during deploys are fine, sustained values are not.
  • Right-size num.replica.fetchers for brokers that follow many partitions, and keep replication and client traffic on adequately provisioned NICs.
  • Monitor per-broker disk latency and I/O wait; a degrading disk shows up as replica lag long before it fully fails.
  • Keep replica.lag.time.max.ms at a sane value so transient GC pauses do not eject healthy followers, and keep min.insync.replicas=2 with replication factor 3 for durability.
  • Wire under-replication alerts into an automated runbook via the incident response dashboard.
  • NOT_ENOUGH_REPLICAS / NOT_ENOUGH_REPLICAS_AFTER_APPEND - producers with acks=all fail when the ISR drops below min.insync.replicas because of this lag.
  • OfflinePartitionsCount > 0 - the worse outcome when every replica, including the leader, is unavailable.
  • Shrinking ISR followed immediately by Expanding ISR - the benign, transient version of these same log lines.
  • LEADER_NOT_AVAILABLE - what clients see when a leaderless or recovering partition cannot serve requests.

Frequently Asked Questions

Q: How do I know if a leftover throttle is causing my lag? Run kafka-reassign-partitions.sh --verify against the most recent reassignment plan. If it reports that a throttle “was not removed” or that leader.replication.throttled.rate is still set, a stale throttle is rate-limiting replication and must be cleared.

Q: Is a brief ISR shrink something to worry about? No. A shrink that immediately expands back during a deploy, restart, or short GC pause is normal. Worry when the follower stays out of the ISR and UnderReplicatedPartitions remains above zero for several minutes.

Q: Will adding more replica fetchers always fix lag? Only when fetcher parallelism is the bottleneck, typically on brokers that follow many busy partitions. If the real limit is a saturated disk, a saturated NIC, or a leftover throttle, more fetchers will not help and may add contention.

Q: What is the difference between this and an offline partition? A lagging follower means the partition is under-replicated but still has a working leader and can serve traffic. An offline partition has no available leader at all. For more replication and failover walkthroughs see the Kafka category.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.