Kafka Error Guide: 'Stopping serving logs in dir /var/lib/kafka' Log Directory Failure
Fix Kafka KafkaStorageException log directory failures: diagnose disk errors, full volumes, bad permissions, and offline JBOD log dirs marked dead by the broker.
- #kafka
- #troubleshooting
- #errors
- #storage
Exact Error Message
When a Kafka broker can no longer write to one of its configured log.dirs, the LogManager marks that directory offline and you will see this in the broker server.log:
[2026-06-29 03:14:52,118] ERROR Error while writing to checkpoint file /var/lib/kafka/replication-offset-checkpoint (kafka.server.LogDirFailureChannel)
java.io.IOException: Input/output error
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at kafka.server.checkpoints.CheckpointFile.write(CheckpointFile.scala:84)
[2026-06-29 03:14:52,121] ERROR Disk error while writing to recovery point file in directory /var/lib/kafka (kafka.log.LogManager)
[2026-06-29 03:14:52,134] ERROR Stopping serving logs in dir /var/lib/kafka (kafka.log.LogManager)
org.apache.kafka.common.errors.KafkaStorageException: Error while writing to checkpoint file /var/lib/kafka/replication-offset-checkpoint
[2026-06-29 03:14:52,901] WARN Stopping serving replicas in dir /var/lib/kafka (kafka.server.ReplicaManager)
[2026-06-29 03:14:53,005] ERROR Shutdown broker because all log dirs in /var/lib/kafka have failed (kafka.log.LogManager)
The signature line is Stopping serving logs in dir ... paired with a KafkaStorageException. If every directory in log.dirs fails, the broker shuts down entirely with Shutdown broker because all log dirs ... have failed.
What the Error Means
Kafka treats each path in log.dirs as an independent storage volume (the JBOD model). The LogDirFailureChannel watches for IO failures on each directory. When a write to a segment, index, or checkpoint file throws an IOException, Kafka does not retry forever — it raises a KafkaStorageException, marks the entire directory offline, and stops serving every partition replica that lived on it.
If at least one other healthy log dir remains, the broker keeps running with reduced capacity and the affected partitions become under-replicated (the controller elects new leaders from other brokers). If the failed directory was the only one, the broker process exits, because a broker with no usable storage cannot function.
This is a storage-layer fault, not a Kafka logic bug. The broker is telling you the underlying filesystem rejected a write.
Common Causes
- Physical disk failure. A failing or failed block device returns
Input/output error(EIO) on write. SMART pending-sector or reallocated-sector counts are climbing. - Full volume. The filesystem hit 100% and writes fail with
No space left on device(ENOSPC). Kafka surfaces this as a storage exception just like a hardware fault. - Wrong ownership or permissions. After a manual restore, package upgrade, or a volume remount, the
log.dirspath is no longer owned by thekafkauser, so writes fail withPermission denied. - Volume unmounted or remounted read-only. A flapping EBS/SAN volume or an ext4 filesystem that flipped to
roafter detecting corruption stops accepting writes. - Filesystem corruption. Underlying ext4/xfs metadata corruption causes intermittent IO errors on specific inodes.
How to Reproduce the Error
In a disposable test environment you can trigger the same offline-directory path safely. Fill the volume that backs a single non-critical log dir:
# Test broker only. Fill the filesystem backing one log dir until writes fail.
fallocate -l $(df --output=avail -B1 /var/lib/kafka | tail -1) /var/lib/kafka/fillfile
Then produce to a topic whose partition lives in that directory. The next segment roll or checkpoint write fails with ENOSPC, the LogManager logs Stopping serving logs in dir /var/lib/kafka, and the partition goes offline. A permissions variant is equally illustrative: chown root:root on the directory and restart the broker — startup write checks fail immediately. Remove the fill file or restore ownership to recover the test box.
Diagnostic Commands
All commands below are read-only inspections. Start with the broker’s own view of which directories are alive.
# Ask the broker which log dirs are online/offline and their sizes
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
--describe --broker-list 1 | python3 -m json.tool | head -40
# Capacity and inode usage on each log dir
df -h /var/lib/kafka
df -i /var/lib/kafka
du -sh /var/lib/kafka/* 2>/dev/null | sort -rh | head
# Ownership and mount state of the log dir
ls -la /var/lib/kafka
findmnt -T /var/lib/kafka
# Pull the storage failure lines from the broker log
grep -E "Stopping serving logs|KafkaStorageException|all log dirs|Input/output error|No space left" \
/var/log/kafka/server.log | tail -30
# Kernel and SMART evidence of a failing device
journalctl -k --since "1 hour ago" | grep -iE "EXT4-fs error|I/O error|remount|blk_update"
sudo smartctl --health /dev/nvme1n1
sudo smartctl -A /dev/nvme1n1 | grep -iE "reallocated|pending|crc|media"
If smartctl --health reports FAILED or pending sectors are nonzero, treat the disk as bad. If df shows 100%, it is a capacity problem. If findmnt shows ro in the options, the filesystem went read-only.
Step-by-Step Resolution
- Identify the offline directory. Use
kafka-log-dirs.sh --describeand theserver.loggrep to confirm exactly which path failed and which partitions it held. - Classify the fault from the diagnostics: full volume (ENOSPC), bad disk (EIO + SMART), permissions (
Permission denied), or read-only remount. - For a full volume: extend the volume or expire data faster (lower
retention.ms/retention.byteson the largest topics). Confirm withdf -hthat headroom returns before restarting the broker. - For permissions: restore ownership with
sudo chown -R kafka:kafka /var/lib/kafkaand ensure the mount is not read-only. - For a bad disk (JBOD): drain the broker. Replace the device, recreate and mount the directory with correct ownership, then let Kafka re-replicate the partitions from healthy brokers on restart.
- Bring the directory back online. A directory marked offline is only re-evaluated on broker restart, so once the underlying issue is fixed, restart the broker. Kafka rescans
log.dirs, recovers clean directories, and refills the recovered partitions from replicas. - Verify recovery. Re-run
kafka-log-dirs.sh --describeto confirm the directory is online and checkkafka-topics.sh --describe --under-replicated-partitionsclears once replication catches up.
Note the log.dir.failure.timeout.ms broker setting (default 30000 ms): the controller waits this long after a log dir failure before fencing the broker’s replicas, which gives a transient blip a chance to clear without forcing leader elections. It does not bring the directory back automatically — recovery still requires a restart after the fault is fixed.
Prevention and Best Practices
- Alert on
df -hheadroom (page at 80%, hard-stop policy at 90%) and onkafka.server:type=ReplicaManager,name=OfflineReplicaCountso you see a failed log dir before the broker shuts down. - Run SMART monitoring (
smartctl/smartd) on the data devices and replace disks proactively at the first pending-sector growth. - Size retention so a single hot topic cannot fill the whole volume; set per-topic
retention.byteson high-throughput topics. - Keep replication factor at 3 with
min.insync.replicas=2so a single offline log dir never causes data loss, only temporary under-replication. - Pin ownership and mount options in configuration management so a remount or package upgrade cannot silently change them; mount data volumes with explicit, persistent
/etc/fstabentries. - For triage help turning a
KafkaStorageExceptionand the broker log into a likely cause, the free incident assistant can summarize the failure pattern.
Related Errors
Found a corrupted index/segment— segment-level corruption discovered on load, distinct from a whole-directory IO failure.Error while flushing log for topic-0— an fsync failure that may precede a full directory failure if the disk is stalling.Could not recover log— recovery failure after a crash, often on a disk that is also throwing IO errors.
Frequently Asked Questions
Does a single failed log dir take the whole broker down?
Only if it is the last healthy directory. With multiple log.dirs, Kafka keeps serving the others and shuts down only when all have failed.
Will the directory come back on its own once the disk recovers? No. An offline log dir is re-scanned only on broker restart. Fix the underlying storage, then restart the broker.
What does log.dir.failure.timeout.ms actually control?
It is how long the controller waits before fencing replicas on a broker that reported a log dir failure, smoothing over transient glitches. It does not auto-recover the directory.
How do I avoid data loss from a bad disk?
Run replication factor 3 with min.insync.replicas=2. A lost log dir then only causes under-replication, and partitions refill from other brokers after you replace the disk and restart.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.