Kafka Error Guide: 'Failed to append records to topic-0 in dir /var/lib/kafka/data' Offline Log Dir
Fix Kafka's KafkaStorageException when a broker fails to append to its local log and marks the data directory offline due to disk, IO, or permission faults.
- #kafka
- #troubleshooting
- #errors
- #partitions
A KafkaStorageException is one of the more alarming errors a Kafka operator can see, because the broker does not just fail a single request — it takes the entire log directory offline. Every partition stored on that directory immediately loses its leader or replica on that broker, and producers start seeing failures. This guide walks through why the error happens, how to diagnose it with read-only commands, and how to bring the broker back cleanly.
Exact Error Message
The error appears in the broker’s server.log (and kafkaServer.out), typically logged by ReplicaManager or LogManager at ERROR level. A representative snippet:
[2026-06-29 14:22:11,304] ERROR [ReplicaManager broker=3] Error processing append operation on partition topic-0 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Failed to append records to topic-0 in dir /var/lib/kafka/data
Caused by: java.io.IOException: No space left on device
at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62)
at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:90)
at kafka.log.LogSegment.append(LogSegment.scala:158)
[2026-06-29 14:22:11,318] ERROR Uncaught exception in scheduled task 'flush-log' (kafka.log.LogManager)
[2026-06-29 14:22:11,341] WARN Stopping serving logs in dir /var/lib/kafka/data (kafka.log.LogManager)
[2026-06-29 14:22:11,402] ERROR Shutdown broker because all log dirs in /var/lib/kafka/data have failed (kafka.log.LogDirFailureChannel)
The chained Caused by: java.io.IOException is the key line — it tells you the underlying operating-system fault.
What the Error Means
Kafka writes every produced record to an on-disk log segment. When a write (or the periodic flush/recovery checkpoint) throws an IOException, Kafka cannot guarantee durability for that directory, so it raises KafkaStorageException and marks the directory offline via the LogDirFailureChannel.
Consequences:
- All partitions hosted in the failed directory go offline on this broker.
- The broker drops out of the ISR for those partitions; leadership moves elsewhere if another in-sync replica exists.
- If the directory holds the only replica (or it is the last broker with
min.insync.replicas), partitions become unavailable. - With a single configured
log.dirs, the broker shuts down entirely (as in the snippet above). With multiple log dirs (JBOD), only the bad directory goes offline.
Common Causes
- Disk full —
No space left on device. The most frequent trigger. Retention has not reclaimed space fast enough, or a partition grew unexpectedly. - Disk hardware / IO failure —
Input/output error (EIO). A failing drive or controller. The kernel usually logs matching errors indmesg. - Filesystem remounted read-only — after an IO error, ext4/xfs may switch to
roto protect data. Writes then fail withRead-only file system. - Permissions — the data directory is not owned by the Kafka user (often after a manual
chown, restore, or volume remount). Fails withPermission denied. - Too many open files —
Too many open files. Kafka keeps file descriptors open per segment; a lownofilelimit exhausts them on brokers with many partitions. - Corrupt segment — a truncated index or
.logsegment that fails CRC/recovery checks on startup.
How to Reproduce the Error
To reproduce safely in a test cluster, exhaust disk on the data volume:
# Inspect free space on the Kafka data volume (read-only)
df -h /var/lib/kafka/data
Filesystem Size Used Avail Use% Mounted on
/dev/nvme1n1 100G 100G 20K 100% /var/lib/kafka/data
Once Avail hits zero and a producer writes to a partition on that volume, the next flush throws IOException: No space left on device and the directory is marked offline. The permission and read-only variants reproduce by chmod-ing the data dir or remounting -o ro (do this only in a lab).
Diagnostic Commands
Start with Kafka’s own view of log directory health:
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
--describe --broker-list 3
Querying log directories on brokers [3].
{"brokers":[{"broker":3,"logDirs":[
{"logDir":"/var/lib/kafka/data","error":"org.apache.kafka.common.errors.KafkaStorageException",
"partitions":[]}]}]}
A non-null "error" field and empty "partitions" confirm the directory is offline. Next, find which partitions lost availability:
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --unavailable-partitions
Topic: topic Partition: 0 Leader: none Replicas: 3,1 Isr: 1
Check free space and the OS-level fault:
df -h /var/lib/kafka/data
ls -ld /var/lib/kafka/data
dmesg | grep -iE "I/O error|EXT4-fs error|read-only|nvme"
[ 9123.55] blk_update_request: I/O error, dev nvme1n1, sector 204800
[ 9123.56] EXT4-fs error (device nvme1n1): ext4_journal_check_start: Detected aborted journal
[ 9123.57] EXT4-fs (nvme1n1): Remounting filesystem read-only
Pull the stack trace and file-descriptor limits:
journalctl -u kafka --since "30 min ago" | grep -i KafkaStorageException
grep -i "KafkaStorageException\|No space left\|Too many open files" \
/var/lib/kafka/logs/server.log
ulimit -n
cat /proc/$(pgrep -f kafka.Kafka)/limits | grep "open files"
Max open files 100000 100000 files
Step-by-Step Resolution
A worked example for the most common case — a full disk:
-
Confirm the root cause from the
Caused byline. Here it wasNo space left on device, corroborated bydf -hshowing 100% usage. -
Reclaim space without deleting Kafka data by hand. Never
rmsegment files manually — you risk corruption and offset gaps. Instead, lower retention so Kafka deletes its own old segments. Editserver.properties(or set per-topic):# server.properties — reduce until disk pressure clears log.retention.hours=72 log.retention.bytes=53687091200Apply per-topic with
kafka-configs.sh --alter(run during a maintenance window), or move the volume to larger storage. -
For IO / read-only faults, the disk itself must be fixed: run
fsckon an unmounted volume, replace the failing drive, then remount read-write. For JBOD, you can decommission just the bad directory by removing it fromlog.dirs. -
For permission faults, restore ownership:
chown -R kafka:kafka /var/lib/kafka/dataso the broker process can write. -
For
Too many open files, raise the limit. SetLimitNOFILE=200000in the systemd unit (/etc/systemd/system/kafka.service) ornofilein/etc/security/limits.conf, then reload systemd. -
Recover the offline directory. Kafka does not re-online a failed directory at runtime. Once the underlying fault is resolved, restart the broker. On startup it runs log recovery, rebuilds indexes for the affected segments, rejoins the ISR, and the partitions return to
Leaderstatus. Verify withkafka-log-dirs.sh --describe(error should benull) andkafka-topics.sh --describe --under-replicated-partitions(should be empty once caught up).
Prevention and Best Practices
- Alert on disk usage early — page at 75% on the Kafka data volume so retention or capacity changes happen before 100%.
- Use JBOD with multiple
log.dirsso a single bad disk takes only that directory offline instead of the whole broker. - Set generous file-descriptor limits (
LimitNOFILE100k+) on brokers with thousands of partitions. - Size retention to disk with
log.retention.bytesper partition, not just time-based retention, to bound worst-case growth. - Replication factor >= 3 with
min.insync.replicas=2so one offline directory never causes data loss or unavailability. - Monitor
OfflineLogDirectoryCountandOfflinePartitionsCountJMX metrics, and consider routing storage faults into an incident response workflow for faster triage.
Related Errors
org.apache.kafka.common.errors.NotEnoughReplicasException— produced when a directory going offline drops the ISR belowmin.insync.replicas.Halting because log truncation is not allowed— a different storage/recovery refusal during follower truncation.ERROR Disk error while ... (kafka.log.LogManager)— sibling message for flush failures.- See more Kafka troubleshooting in the Kafka category.
Frequently Asked Questions
Q: Does a KafkaStorageException mean I lost data? Not necessarily. If replication factor is >= 2 and another replica was in the ISR, leadership fails over with no loss. Loss only occurs if the offline directory held the only in-sync copy of a partition.
Q: Can I bring the failed log directory back online without restarting the broker? No. Kafka marks the directory offline for the broker’s lifetime and only re-attempts recovery on startup. After fixing the disk, permission, or limit problem, restart the broker so it runs log recovery.
Q: Is it safe to delete .log files to free space?
No. Manually deleting segments corrupts the offset index and can break consumers. Reduce log.retention.bytes/log.retention.hours and let Kafka delete its own segments, or add storage.
Q: Why did the whole broker shut down instead of just one partition?
With a single entry in log.dirs, losing that directory means all log dirs have failed, so Kafka shuts down. Configure multiple log dirs (JBOD) to isolate a single disk failure.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.