Kafka Error Guide: 'Partition marked offline'

An offline partition is one of the most disruptive states a Kafka cluster can reach. When the broker metric OfflinePartitionsCount climbs above zero, it means one or more partitions have no leader, so producers cannot write to them and consumers cannot read from them. This guide explains why partitions go offline, how to confirm the partition state, and how to recover by restoring the replicas that hold the data.

Exact Error Message

The controller and state-change logs make the offline transition explicit. On the active controller you will typically see entries like this in controller.log and state-change.log:

[2026-06-24 14:02:11,873] INFO [Controller id=2] Partition orders-0 marked offline because no live replica is in the ISR (kafka.controller.KafkaController)
[2026-06-24 14:02:11,901] WARN [Controller id=2] Cannot elect leader for partition orders-0: no eligible replica online and unclean.leader.election.enable=false (kafka.controller.PartitionStateMachine)
[2026-06-24 14:02:11,902] TRACE [Controller id=2 epoch=18] Changed state of partition orders-0 from OnlinePartition to OfflinePartition (state.change.logger)

A consuming or producing client sees a downstream symptom such as:

[2026-06-24 14:02:13,440] WARN [Producer clientId=order-svc] Got error produce response with correlation id 884 on topic-partition orders-0, retrying. Error: LEADER_NOT_AVAILABLE (org.apache.kafka.clients.producer.internals.Sender)

The defining signal is the partition state itself: OfflinePartition, no leader, and a non-zero OfflinePartitionsCount JMX gauge.

What the Error Means

Every Kafka partition has a single leader replica that handles all reads and writes. The controller elects that leader from the in-sync replica set (ISR). A partition becomes offline when the controller cannot find any eligible replica to serve as leader. That is fundamentally a partition state problem, not a transient routing problem.

This distinguishes it from a stale-metadata or “not leader” situation on the client side, where a leader does exist but the client is talking to the wrong broker. With an offline partition, there is genuinely no leader anywhere in the cluster. kafka-topics.sh --describe will literally show Leader: none, and frequently an empty or shrunken Isr list. Until a replica is restored or promoted, the partition stays dark.

Common Causes

All replicas for the partition are down. Every broker hosting a replica of the partition is offline (crashed, stopped, or partitioned from the cluster). With no live replica, there is nothing to elect.
The log directory holding the partition went offline. A failed disk or a KafkaStorageException earlier in the broker’s life marks a log dir offline. The broker stays up, but the partitions on that dir are dropped from the ISR and can become leaderless.
unclean.leader.election.enable=false after leader loss. The previous leader failed and no remaining replica was in the ISR. With unclean election disabled (the safe default), the controller refuses to promote an out-of-sync replica, so the partition stays offline to avoid data loss.
Broker crash on the leader with a lagging follower set. The leader died while followers were behind, so the ISR collapsed to just the now-dead leader.
Controller cannot find an eligible leader. Even with brokers up, ISR bookkeeping or a stuck controller can leave the partition without a usable candidate.

How to Reproduce the Error

In a lab cluster you can deterministically create an offline partition:

Create a topic with replication factor 1: a single replica means a single point of failure.
Identify which broker hosts the partition.
Stop that broker. Because there is no other replica, the partition immediately has no leader.

With replication factor greater than 1, stop all brokers in the partition’s replica list, or stop the leader after forcing the ISR down to one member with min.insync.replicas and unclean.leader.election.enable=false. Watch OfflinePartitionsCount move above zero within a controller heartbeat.

Diagnostic Commands

Start by listing partitions that have no leader. This is the single most useful command:

kafka-topics.sh --bootstrap-server broker1:9092 \
  --describe --unavailable-partitions

Topic: orders   Partition: 0   Leader: none   Replicas: 3,4   Isr:

Leader: none with an empty Isr confirms an offline partition whose replicas are all unavailable. Compare against under-replicated partitions to understand the broader blast radius:

kafka-topics.sh --bootstrap-server broker1:9092 \
  --describe --under-replicated-partitions

Describe the full topic to see the replica assignment versus the ISR:

kafka-topics.sh --bootstrap-server broker1:9092 \
  --describe --topic orders

Topic: orders  PartitionCount: 6  ReplicationFactor: 2
  Topic: orders  Partition: 0  Leader: none  Replicas: 3,4  Isr:
  Topic: orders  Partition: 1  Leader: 1     Replicas: 1,2  Isr: 1,2

Check whether a log directory has gone offline on the brokers that host the replicas (3 and 4 here):

kafka-log-dirs.sh --bootstrap-server broker1:9092 \
  --describe --broker-list 3,4

{"brokers":[{"broker":3,"logDirs":[
  {"logDir":"/var/kafka/data-1","error":"KafkaStorageException","partitions":[]}
]}]}

An error field on a log dir means the disk or directory is offline. Confirm the brokers are actually running and inspect their logs:

ss -ltnp | grep 9092
journalctl -u kafka --since "20 min ago" | grep -Ei "storage|offline|shutdown"
grep -E "marked offline|OfflinePartition|no eligible" \
  /var/log/kafka/controller.log /var/log/kafka/state-change.log

Step-by-Step Resolution

A worked recovery for orders-0 with replicas on brokers 3 and 4:

Confirm the state. kafka-topics.sh --describe --unavailable-partitions shows Leader: none, Isr: empty. The replica list is 3,4.
Locate the replicas. Both brokers 3 and 4 host the only copies of the data. The fix is to bring at least one of them back into a healthy state.
Check why they are down. journalctl -u kafka on broker 3 shows a KafkaStorageException for /var/kafka/data-1. Broker 4 is fully stopped.
Recover the offline log dir. If broker 3’s disk failed, replace or repair the underlying volume, then restart the broker so it re-registers the log dir and reloads the partition from disk. If the data is intact and only the mount flaked, remount it cleanly and restart the broker. Kafka brings the partition back into the ISR on startup.
Restart the stopped broker. Bring broker 4 back online. As soon as one in-sync replica is available, the controller elects a leader and the partition transitions from OfflinePartition back to OnlinePartition.
Verify recovery. Re-run kafka-topics.sh --describe --topic orders. Leader should now show a broker id and Isr should repopulate. OfflinePartitionsCount returns to zero.

If both replicas are permanently lost, the only ways to restore availability are restoring the brokers’ data from backup, or enabling unclean leader election to promote a stale replica at the cost of losing the most recent records. Unclean election is a data-loss decision: weigh it against your durability requirements before touching unclean.leader.election.enable. Prefer restoring replicas whenever the data still exists.

Prevention and Best Practices

Run replication factor 3 with min.insync.replicas=2 so a single broker or disk failure never leaves a partition leaderless.
Spread replicas across racks or availability zones using rack awareness so a correlated failure does not take out every replica of a partition.
Alert directly on the OfflinePartitionsCount and UnderReplicatedPartitions controller metrics; both should sit at zero at steady state.
Monitor disk health and LogDirOffline events so storage failures are caught before they cascade.
Keep unclean.leader.election.enable=false as the default and treat any need for it as an incident, not routine operation.
Track these signals on a single pane such as /dashboard/incident-response/ so on-call sees offline partitions the moment they appear.

NOT_LEADER_OR_FOLLOWER / stale client metadata: A leader exists but the client targets the wrong broker. Self-heals via metadata refresh, unlike an offline partition.
KafkaStorageException: The upstream cause when a log dir fails; it removes partitions from the ISR and can lead directly to offline partitions.
UnderReplicatedPartitions > 0: A warning state where the partition still has a leader but fewer ISR members than configured. Left unaddressed it can degrade into offline partitions.
LEADER_NOT_AVAILABLE: The client-visible symptom returned while a partition is offline or mid-election.

Browse more Kafka troubleshooting at /categories/kafka/.

Frequently Asked Questions

Q: What exactly does OfflinePartitionsCount measure? It is a controller JMX gauge counting partitions that currently have no leader. Any value above zero means those partitions are unavailable for produce and consume until a leader is elected. It should always be zero at steady state.

Q: Why won’t Kafka just pick a new leader automatically? The controller only elects from in-sync replicas by default. If every ISR member is down, there is no safe candidate. With unclean.leader.election.enable=false, it deliberately refuses to promote a stale replica rather than risk losing committed data.

Q: Will restarting the brokers always fix an offline partition? It fixes the common cases where brokers crashed or were stopped, because restarting brings the replicas back into the ISR. It will not help if the underlying disk that held the only copies is permanently destroyed; that requires restoring from backup or accepting unclean election.

Q: How is this different from a NOT_LEADER_OR_FOLLOWER error on my producer? NOT_LEADER_OR_FOLLOWER means a leader exists but your client cached the wrong one and will retry after a metadata refresh. An offline partition means no leader exists anywhere, so retries alone cannot recover it; you must restore or promote a replica.

Kafka Error Guide: 'Partition marked offline' OfflinePartitionsCount > 0

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Related Errors

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit