Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Failed to append records to topic-0 in dir /var/lib/kafka/data' Offline Log Dir

Fix Kafka's KafkaStorageException when a broker fails to append to its local log and marks the data directory offline due to disk, IO, or permission faults.

  • #kafka
  • #troubleshooting
  • #errors
  • #partitions

A KafkaStorageException is one of the more alarming errors a Kafka operator can see, because the broker does not just fail a single request — it takes the entire log directory offline. Every partition stored on that directory immediately loses its leader or replica on that broker, and producers start seeing failures. This guide walks through why the error happens, how to diagnose it with read-only commands, and how to bring the broker back cleanly.

Exact Error Message

The error appears in the broker’s server.log (and kafkaServer.out), typically logged by ReplicaManager or LogManager at ERROR level. A representative snippet:

[2026-06-29 14:22:11,304] ERROR [ReplicaManager broker=3] Error processing append operation on partition topic-0 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Failed to append records to topic-0 in dir /var/lib/kafka/data
Caused by: java.io.IOException: No space left on device
	at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62)
	at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:90)
	at kafka.log.LogSegment.append(LogSegment.scala:158)
[2026-06-29 14:22:11,318] ERROR Uncaught exception in scheduled task 'flush-log' (kafka.log.LogManager)
[2026-06-29 14:22:11,341] WARN  Stopping serving logs in dir /var/lib/kafka/data (kafka.log.LogManager)
[2026-06-29 14:22:11,402] ERROR Shutdown broker because all log dirs in /var/lib/kafka/data have failed (kafka.log.LogDirFailureChannel)

The chained Caused by: java.io.IOException is the key line — it tells you the underlying operating-system fault.

What the Error Means

Kafka writes every produced record to an on-disk log segment. When a write (or the periodic flush/recovery checkpoint) throws an IOException, Kafka cannot guarantee durability for that directory, so it raises KafkaStorageException and marks the directory offline via the LogDirFailureChannel.

Consequences:

  • All partitions hosted in the failed directory go offline on this broker.
  • The broker drops out of the ISR for those partitions; leadership moves elsewhere if another in-sync replica exists.
  • If the directory holds the only replica (or it is the last broker with min.insync.replicas), partitions become unavailable.
  • With a single configured log.dirs, the broker shuts down entirely (as in the snippet above). With multiple log dirs (JBOD), only the bad directory goes offline.

Common Causes

  1. Disk fullNo space left on device. The most frequent trigger. Retention has not reclaimed space fast enough, or a partition grew unexpectedly.
  2. Disk hardware / IO failureInput/output error (EIO). A failing drive or controller. The kernel usually logs matching errors in dmesg.
  3. Filesystem remounted read-only — after an IO error, ext4/xfs may switch to ro to protect data. Writes then fail with Read-only file system.
  4. Permissions — the data directory is not owned by the Kafka user (often after a manual chown, restore, or volume remount). Fails with Permission denied.
  5. Too many open filesToo many open files. Kafka keeps file descriptors open per segment; a low nofile limit exhausts them on brokers with many partitions.
  6. Corrupt segment — a truncated index or .log segment that fails CRC/recovery checks on startup.

How to Reproduce the Error

To reproduce safely in a test cluster, exhaust disk on the data volume:

# Inspect free space on the Kafka data volume (read-only)
df -h /var/lib/kafka/data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    100G  100G   20K 100% /var/lib/kafka/data

Once Avail hits zero and a producer writes to a partition on that volume, the next flush throws IOException: No space left on device and the directory is marked offline. The permission and read-only variants reproduce by chmod-ing the data dir or remounting -o ro (do this only in a lab).

Diagnostic Commands

Start with Kafka’s own view of log directory health:

kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --broker-list 3
Querying log directories on brokers [3].
{"brokers":[{"broker":3,"logDirs":[
  {"logDir":"/var/lib/kafka/data","error":"org.apache.kafka.common.errors.KafkaStorageException",
   "partitions":[]}]}]}

A non-null "error" field and empty "partitions" confirm the directory is offline. Next, find which partitions lost availability:

kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --unavailable-partitions
Topic: topic   Partition: 0   Leader: none   Replicas: 3,1   Isr: 1

Check free space and the OS-level fault:

df -h /var/lib/kafka/data
ls -ld /var/lib/kafka/data
dmesg | grep -iE "I/O error|EXT4-fs error|read-only|nvme"
[ 9123.55] blk_update_request: I/O error, dev nvme1n1, sector 204800
[ 9123.56] EXT4-fs error (device nvme1n1): ext4_journal_check_start: Detected aborted journal
[ 9123.57] EXT4-fs (nvme1n1): Remounting filesystem read-only

Pull the stack trace and file-descriptor limits:

journalctl -u kafka --since "30 min ago" | grep -i KafkaStorageException
grep -i "KafkaStorageException\|No space left\|Too many open files" \
  /var/lib/kafka/logs/server.log
ulimit -n
cat /proc/$(pgrep -f kafka.Kafka)/limits | grep "open files"
Max open files            100000               100000               files

Step-by-Step Resolution

A worked example for the most common case — a full disk:

  1. Confirm the root cause from the Caused by line. Here it was No space left on device, corroborated by df -h showing 100% usage.

  2. Reclaim space without deleting Kafka data by hand. Never rm segment files manually — you risk corruption and offset gaps. Instead, lower retention so Kafka deletes its own old segments. Edit server.properties (or set per-topic):

    # server.properties — reduce until disk pressure clears
    log.retention.hours=72
    log.retention.bytes=53687091200

    Apply per-topic with kafka-configs.sh --alter (run during a maintenance window), or move the volume to larger storage.

  3. For IO / read-only faults, the disk itself must be fixed: run fsck on an unmounted volume, replace the failing drive, then remount read-write. For JBOD, you can decommission just the bad directory by removing it from log.dirs.

  4. For permission faults, restore ownership: chown -R kafka:kafka /var/lib/kafka/data so the broker process can write.

  5. For Too many open files, raise the limit. Set LimitNOFILE=200000 in the systemd unit (/etc/systemd/system/kafka.service) or nofile in /etc/security/limits.conf, then reload systemd.

  6. Recover the offline directory. Kafka does not re-online a failed directory at runtime. Once the underlying fault is resolved, restart the broker. On startup it runs log recovery, rebuilds indexes for the affected segments, rejoins the ISR, and the partitions return to Leader status. Verify with kafka-log-dirs.sh --describe (error should be null) and kafka-topics.sh --describe --under-replicated-partitions (should be empty once caught up).

Prevention and Best Practices

  • Alert on disk usage early — page at 75% on the Kafka data volume so retention or capacity changes happen before 100%.
  • Use JBOD with multiple log.dirs so a single bad disk takes only that directory offline instead of the whole broker.
  • Set generous file-descriptor limits (LimitNOFILE 100k+) on brokers with thousands of partitions.
  • Size retention to disk with log.retention.bytes per partition, not just time-based retention, to bound worst-case growth.
  • Replication factor >= 3 with min.insync.replicas=2 so one offline directory never causes data loss or unavailability.
  • Monitor OfflineLogDirectoryCount and OfflinePartitionsCount JMX metrics, and consider routing storage faults into an incident response workflow for faster triage.
  • org.apache.kafka.common.errors.NotEnoughReplicasException — produced when a directory going offline drops the ISR below min.insync.replicas.
  • Halting because log truncation is not allowed — a different storage/recovery refusal during follower truncation.
  • ERROR Disk error while ... (kafka.log.LogManager) — sibling message for flush failures.
  • See more Kafka troubleshooting in the Kafka category.

Frequently Asked Questions

Q: Does a KafkaStorageException mean I lost data? Not necessarily. If replication factor is >= 2 and another replica was in the ISR, leadership fails over with no loss. Loss only occurs if the offline directory held the only in-sync copy of a partition.

Q: Can I bring the failed log directory back online without restarting the broker? No. Kafka marks the directory offline for the broker’s lifetime and only re-attempts recovery on startup. After fixing the disk, permission, or limit problem, restart the broker so it runs log recovery.

Q: Is it safe to delete .log files to free space? No. Manually deleting segments corrupts the offset index and can break consumers. Reduce log.retention.bytes/log.retention.hours and let Kafka delete its own segments, or add storage.

Q: Why did the whole broker shut down instead of just one partition? With a single entry in log.dirs, losing that directory means all log dirs have failed, so Kafka shuts down. Configure multiple log dirs (JBOD) to isolate a single disk failure.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.