Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Error while flushing log' fsync Failure on Broker

Fix Kafka 'Error while flushing log for topic-0' fsync failures: diagnose disk stalls, IO errors, and storage latency that mark a log directory offline via KafkaStorageException.

  • #kafka
  • #troubleshooting
  • #errors
  • #storage

Exact Error Message

When the broker calls fsync on a segment and the OS returns an error, the LogManager logs a flush failure in server.log:

[2026-06-29 11:52:03,778] ERROR Error while flushing log for orders-0 in dir /var/lib/kafka with offset 5242880 (exclusive) and recovery point 5242880 (kafka.server.LogDirFailureChannel)
java.io.IOException: Input/output error
        at java.base/sun.nio.ch.FileDispatcherImpl.force0(Native Method)
        at java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:466)
        at org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:209)
        at kafka.log.LogSegment.flush(LogSegment.scala:475)
[2026-06-29 11:52:03,901] ERROR Uncaught exception in scheduled task 'flush-log' (kafka.utils.KafkaScheduler)
org.apache.kafka.common.errors.KafkaStorageException: Error while flushing log for orders-0 in dir /var/lib/kafka
[2026-06-29 11:52:04,010] ERROR Stopping serving replicas in dir /var/lib/kafka (kafka.server.ReplicaManager)

The signature is Error while flushing log for <topic-partition> in dir ... raised as a KafkaStorageException from a force0/fsync call.

What the Error Means

Kafka writes records into the active segment via the page cache and periodically forces them to stable storage with fsync (FileChannel.force). Flushes happen on segment roll, on the background flush-log scheduler, and when flush.messages/flush.ms thresholds are crossed. fsync is the moment Kafka demands the OS actually persist data to the device.

When that fsync returns an IOException, the broker cannot guarantee durability for that partition, so it raises a KafkaStorageException. As with any storage exception, Kafka marks the containing log directory offline and stops serving the replicas on it. A flush failure is therefore a precursor to — or a form of — a log directory failure, distinguished by the flushing log wording and the fsync stack frame.

This is a disk-layer fault: the device or filesystem rejected or could not complete the durable write.

Common Causes

  • Disk IO error (EIO). The device returned an error on force0 — a failing or disconnected disk, a degraded RAID member, or a flapping network-attached volume.
  • Disk stall / extreme latency. A saturated or stalling device makes fsync block for seconds; combined with kernel IO errors this surfaces as a failed flush. On cloud volumes this often means burst-balance/IOPS exhaustion.
  • Filesystem turned read-only. ext4 detecting an error remounts read-only, so the next fsync fails.
  • Full volume. ENOSPC during a flush of newly rolled data.
  • Storage controller or driver fault. A controller reset or driver bug aborts the in-flight fsync with an IO error.

How to Reproduce the Error

Reproducing a true fsync failure requires injecting a storage fault, which you should only do on a disposable test box. The cleanest method is a Linux device-mapper “error” or “delay” target placed under a test broker’s data volume so fsync returns EIO or stalls past timeouts:

# Test box only. Map the data device through a dm error target so writes/fsync fail.
sudo dmsetup create kafka-fault --table \
  "0 $(blockdev --getsz /dev/nvme1n1) error"

Point a test broker’s log.dirs at a filesystem on /dev/mapper/kafka-fault, produce messages, and the next flush fails with Error while flushing log ... and marks the dir offline. Tear the mapping down (dmsetup remove kafka-fault) to recover. Never apply this to production storage.

Diagnostic Commands

All commands below are read-only. Confirm whether the disk is erroring or merely slow.

# Pull flush failures and the partition/dir involved
grep -E "Error while flushing log|KafkaStorageException|force0|Stopping serving" \
  /var/log/kafka/server.log | tail -30
# Kernel-level IO errors, controller resets, and read-only remounts
journalctl -k --since "1 hour ago" | grep -iE "I/O error|EXT4-fs error|remounting.*read-only|nvme.*reset|task abort"
# Device health and error counters
sudo smartctl --health /dev/nvme1n1
sudo smartctl -A /dev/nvme1n1 | grep -iE "media|error|reallocated|pending|crc"
# Capacity and mount state of the log dir
df -h /var/lib/kafka
findmnt -T /var/lib/kafka
ls -la /var/lib/kafka
# Broker's own view of online/offline dirs and partition sizes
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --broker-list 1 | python3 -m json.tool | head -40
du -sh /var/lib/kafka/orders-0/

If kernel logs show nvme ... reset or I/O error around the flush timestamp, the device faulted. If findmnt shows ro, the filesystem went read-only. If smartctl --health is FAILED, replace the disk.

Step-by-Step Resolution

  1. Correlate timestamps. Match the Error while flushing log time to kernel IO errors or a read-only remount in journalctl -k. That confirms it is a storage fault, not a Kafka bug.
  2. Classify the fault: hard IO error (replace disk), read-only remount (fsck/remount), full volume (extend/expire), or stall/latency (relieve IO pressure).
  3. For a hard disk failure (JBOD), drain the broker, replace the device, recreate and mount the log dir with kafka:kafka ownership, and let the broker re-replicate the offline partitions on restart.
  4. For a read-only filesystem, unmount, run fsck, and remount read-write after confirming the device is healthy; if the device is failing, replace it instead.
  5. For latency/stall on cloud volumes, move to a higher-IOPS/throughput volume tier; the flush will keep failing under sustained IOPS exhaustion.
  6. Bring the directory back online. An offline log dir is only re-evaluated on broker restart, so restart after fixing the storage and let recovery and re-replication run.
  7. Verify with kafka-log-dirs.sh --describe (dir online) and kafka-topics.sh --describe --under-replicated-partitions (clears as replicas catch up).

The log.dir.failure.timeout.ms setting (default 30000 ms) governs how long the controller waits before fencing the failed broker’s replicas after the flush error, smoothing a transient stall. It does not auto-recover the directory; a restart is still required once the disk is fixed.

Prevention and Best Practices

  • Use durable, predictable storage: provisioned-IOPS volumes or local NVMe sized for your sustained flush load, and avoid burst-credit volumes for high-throughput brokers.
  • Alert on disk IO error counters and on kafka.server:type=ReplicaManager,name=OfflineReplicaCount so you see a flush-driven offline dir immediately.
  • Monitor fsync latency and device queue depth; rising latency is the early warning before an outright flush failure.
  • Rely on replication (RF 3, min.insync.replicas=2, acks=all) rather than aggressive flush.ms for durability — let other brokers be your safety net so one disk fault never loses acknowledged data.
  • Run SMART monitoring and replace disks at first sign of media errors.
  • For quick triage of a KafkaStorageException flush failure, the free incident assistant can turn the broker log and kernel messages into a likely cause.
  • KafkaStorageException: Stopping serving logs in dir — the broader log-directory failure that a flush error triggers.
  • Could not recover log — recovery failure on restart, often on the same failing disk.
  • Found a corrupted segment — segment corruption that can result if a crash interrupts an in-flight flush.

Frequently Asked Questions

Did I lose acknowledged data when a flush failed? Not if you used acks=all with min.insync.replicas=2. The data exists on in-sync replicas; the partition just goes offline on this broker until storage is fixed.

Is this a Kafka bug or a disk problem? Almost always a disk problem. The fsync stack frame (force0) and matching kernel IO errors point at storage, not Kafka.

Why did one flush failure take partitions offline? A failed fsync means Kafka cannot guarantee durability for that directory, so it marks the whole log dir offline rather than risk silent data loss.

Will the broker recover on its own after the disk recovers? No. The offline directory is only re-scanned on restart. Fix the storage, then restart the broker so it recovers and re-replicates.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.