Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Stopping serving logs in dir /var/lib/kafka' Log Directory Failure

Fix Kafka KafkaStorageException log directory failures: diagnose disk errors, full volumes, bad permissions, and offline JBOD log dirs marked dead by the broker.

  • #kafka
  • #troubleshooting
  • #errors
  • #storage

Exact Error Message

When a Kafka broker can no longer write to one of its configured log.dirs, the LogManager marks that directory offline and you will see this in the broker server.log:

[2026-06-29 03:14:52,118] ERROR Error while writing to checkpoint file /var/lib/kafka/replication-offset-checkpoint (kafka.server.LogDirFailureChannel)
java.io.IOException: Input/output error
        at java.base/java.io.FileOutputStream.writeBytes(Native Method)
        at kafka.server.checkpoints.CheckpointFile.write(CheckpointFile.scala:84)
[2026-06-29 03:14:52,121] ERROR Disk error while writing to recovery point file in directory /var/lib/kafka (kafka.log.LogManager)
[2026-06-29 03:14:52,134] ERROR Stopping serving logs in dir /var/lib/kafka (kafka.log.LogManager)
org.apache.kafka.common.errors.KafkaStorageException: Error while writing to checkpoint file /var/lib/kafka/replication-offset-checkpoint
[2026-06-29 03:14:52,901] WARN Stopping serving replicas in dir /var/lib/kafka (kafka.server.ReplicaManager)
[2026-06-29 03:14:53,005] ERROR Shutdown broker because all log dirs in /var/lib/kafka have failed (kafka.log.LogManager)

The signature line is Stopping serving logs in dir ... paired with a KafkaStorageException. If every directory in log.dirs fails, the broker shuts down entirely with Shutdown broker because all log dirs ... have failed.

What the Error Means

Kafka treats each path in log.dirs as an independent storage volume (the JBOD model). The LogDirFailureChannel watches for IO failures on each directory. When a write to a segment, index, or checkpoint file throws an IOException, Kafka does not retry forever — it raises a KafkaStorageException, marks the entire directory offline, and stops serving every partition replica that lived on it.

If at least one other healthy log dir remains, the broker keeps running with reduced capacity and the affected partitions become under-replicated (the controller elects new leaders from other brokers). If the failed directory was the only one, the broker process exits, because a broker with no usable storage cannot function.

This is a storage-layer fault, not a Kafka logic bug. The broker is telling you the underlying filesystem rejected a write.

Common Causes

  • Physical disk failure. A failing or failed block device returns Input/output error (EIO) on write. SMART pending-sector or reallocated-sector counts are climbing.
  • Full volume. The filesystem hit 100% and writes fail with No space left on device (ENOSPC). Kafka surfaces this as a storage exception just like a hardware fault.
  • Wrong ownership or permissions. After a manual restore, package upgrade, or a volume remount, the log.dirs path is no longer owned by the kafka user, so writes fail with Permission denied.
  • Volume unmounted or remounted read-only. A flapping EBS/SAN volume or an ext4 filesystem that flipped to ro after detecting corruption stops accepting writes.
  • Filesystem corruption. Underlying ext4/xfs metadata corruption causes intermittent IO errors on specific inodes.

How to Reproduce the Error

In a disposable test environment you can trigger the same offline-directory path safely. Fill the volume that backs a single non-critical log dir:

# Test broker only. Fill the filesystem backing one log dir until writes fail.
fallocate -l $(df --output=avail -B1 /var/lib/kafka | tail -1) /var/lib/kafka/fillfile

Then produce to a topic whose partition lives in that directory. The next segment roll or checkpoint write fails with ENOSPC, the LogManager logs Stopping serving logs in dir /var/lib/kafka, and the partition goes offline. A permissions variant is equally illustrative: chown root:root on the directory and restart the broker — startup write checks fail immediately. Remove the fill file or restore ownership to recover the test box.

Diagnostic Commands

All commands below are read-only inspections. Start with the broker’s own view of which directories are alive.

# Ask the broker which log dirs are online/offline and their sizes
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --broker-list 1 | python3 -m json.tool | head -40
# Capacity and inode usage on each log dir
df -h /var/lib/kafka
df -i /var/lib/kafka
du -sh /var/lib/kafka/* 2>/dev/null | sort -rh | head
# Ownership and mount state of the log dir
ls -la /var/lib/kafka
findmnt -T /var/lib/kafka
# Pull the storage failure lines from the broker log
grep -E "Stopping serving logs|KafkaStorageException|all log dirs|Input/output error|No space left" \
  /var/log/kafka/server.log | tail -30
# Kernel and SMART evidence of a failing device
journalctl -k --since "1 hour ago" | grep -iE "EXT4-fs error|I/O error|remount|blk_update"
sudo smartctl --health /dev/nvme1n1
sudo smartctl -A /dev/nvme1n1 | grep -iE "reallocated|pending|crc|media"

If smartctl --health reports FAILED or pending sectors are nonzero, treat the disk as bad. If df shows 100%, it is a capacity problem. If findmnt shows ro in the options, the filesystem went read-only.

Step-by-Step Resolution

  1. Identify the offline directory. Use kafka-log-dirs.sh --describe and the server.log grep to confirm exactly which path failed and which partitions it held.
  2. Classify the fault from the diagnostics: full volume (ENOSPC), bad disk (EIO + SMART), permissions (Permission denied), or read-only remount.
  3. For a full volume: extend the volume or expire data faster (lower retention.ms/retention.bytes on the largest topics). Confirm with df -h that headroom returns before restarting the broker.
  4. For permissions: restore ownership with sudo chown -R kafka:kafka /var/lib/kafka and ensure the mount is not read-only.
  5. For a bad disk (JBOD): drain the broker. Replace the device, recreate and mount the directory with correct ownership, then let Kafka re-replicate the partitions from healthy brokers on restart.
  6. Bring the directory back online. A directory marked offline is only re-evaluated on broker restart, so once the underlying issue is fixed, restart the broker. Kafka rescans log.dirs, recovers clean directories, and refills the recovered partitions from replicas.
  7. Verify recovery. Re-run kafka-log-dirs.sh --describe to confirm the directory is online and check kafka-topics.sh --describe --under-replicated-partitions clears once replication catches up.

Note the log.dir.failure.timeout.ms broker setting (default 30000 ms): the controller waits this long after a log dir failure before fencing the broker’s replicas, which gives a transient blip a chance to clear without forcing leader elections. It does not bring the directory back automatically — recovery still requires a restart after the fault is fixed.

Prevention and Best Practices

  • Alert on df -h headroom (page at 80%, hard-stop policy at 90%) and on kafka.server:type=ReplicaManager,name=OfflineReplicaCount so you see a failed log dir before the broker shuts down.
  • Run SMART monitoring (smartctl/smartd) on the data devices and replace disks proactively at the first pending-sector growth.
  • Size retention so a single hot topic cannot fill the whole volume; set per-topic retention.bytes on high-throughput topics.
  • Keep replication factor at 3 with min.insync.replicas=2 so a single offline log dir never causes data loss, only temporary under-replication.
  • Pin ownership and mount options in configuration management so a remount or package upgrade cannot silently change them; mount data volumes with explicit, persistent /etc/fstab entries.
  • For triage help turning a KafkaStorageException and the broker log into a likely cause, the free incident assistant can summarize the failure pattern.
  • Found a corrupted index/segment — segment-level corruption discovered on load, distinct from a whole-directory IO failure.
  • Error while flushing log for topic-0 — an fsync failure that may precede a full directory failure if the disk is stalling.
  • Could not recover log — recovery failure after a crash, often on a disk that is also throwing IO errors.

Frequently Asked Questions

Does a single failed log dir take the whole broker down? Only if it is the last healthy directory. With multiple log.dirs, Kafka keeps serving the others and shuts down only when all have failed.

Will the directory come back on its own once the disk recovers? No. An offline log dir is re-scanned only on broker restart. Fix the underlying storage, then restart the broker.

What does log.dir.failure.timeout.ms actually control? It is how long the controller waits before fencing replicas on a broker that reported a log dir failure, smoothing over transient glitches. It does not auto-recover the directory.

How do I avoid data loss from a bad disk? Run replication factor 3 with min.insync.replicas=2. A lost log dir then only causes under-replication, and partitions refill from other brokers after you replace the disk and restart.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.