Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Could not recover log' Recovery Failure After Crash

Fix Kafka 'Could not recover log' errors: diagnose crash recovery failures, 'Unable to allocate log segment', disk-full recovery, and brokers stuck on startup.

  • #kafka
  • #troubleshooting
  • #errors
  • #storage

Exact Error Message

When a broker restarts after a crash and cannot finish recovering a partition’s log, it fails startup with a recovery error in server.log:

[2026-06-29 09:30:44,211] INFO [Log partition=events-12, dir=/var/lib/kafka] Recovering unflushed segment 3145728 (kafka.log.UnifiedLog)
[2026-06-29 09:30:44,889] ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.errors.KafkaStorageException: Could not recover log for partition events-12 in dir /var/lib/kafka (kafka.log.LogManager)
[2026-06-29 09:30:44,902] ERROR Error while loading log dir /var/lib/kafka (kafka.log.LogManager)
java.io.IOException: No space left on device
        at java.base/sun.nio.ch.FileDispatcherImpl.truncate0(Native Method)
        at kafka.log.LogSegment.recover(LogSegment.scala:391)
[2026-06-29 09:30:45,114] ERROR Unable to allocate log segment for events-12 due to insufficient disk space (kafka.log.LogManager)
[2026-06-29 09:30:45,330] ERROR Shutdown broker because all log dirs in /var/lib/kafka have failed (kafka.log.LogManager)

The signatures are Could not recover log for partition ... in dir ..., error during recovery, and Unable to allocate log segment.

What the Error Means

After an unclean shutdown, the broker recovers each partition’s unflushed segments on startup: it re-reads the active (and any unflushed) segments, validates and truncates partial batches, and rebuilds indexes. Recovery requires writing — truncating files, rebuilding .index/.timeindex, and sometimes allocating a fresh segment. If any of those writes fails, recovery cannot complete and Kafka raises a KafkaStorageException: Could not recover log.

Unlike a self-healing index rebuild, this is a hard failure of the recovery process itself. The most common reason is that the disk is full, so the truncate/allocate operations needed to recover fail with ENOSPC. It can also stem from genuine data corruption that recovery cannot resolve, or from a disk that is throwing IO errors during the recovery writes. When recovery fails on the last healthy log dir, the broker shuts down.

Common Causes

  • Disk full during recovery. Recovery needs to write (truncate, rebuild indexes, allocate a segment). On a full volume these fail with No space left on device, and Unable to allocate log segment follows. This is the classic post-crash trap: the broker crashed because the disk filled, and now it cannot recover for the same reason.
  • Unrecoverable segment corruption. The data segment is damaged beyond what truncation can fix, so recovery throws rather than completing.
  • Disk IO errors during recovery. A failing device returns EIO on the recovery writes.
  • Permissions changed. The log.dirs path lost kafka ownership, so recovery writes are denied.
  • Read-only filesystem. The volume remounted read-only, blocking all recovery writes.

How to Reproduce the Error

On a disposable test broker, the disk-full recovery trap is easy to stage. Fill the volume, then restart the broker so recovery must write into a full filesystem:

# Test broker only. Fill the data volume, then restart to force a failed recovery.
fallocate -l $(df --output=avail -B1 /var/lib/kafka | tail -1) /var/lib/kafka/fillfile

With the volume at 100%, hard-kill and restart the broker. Recovery attempts to truncate/rebuild segments, the writes fail with ENOSPC, and the broker logs Could not recover log ... Unable to allocate log segment and shuts down. Delete the fill file to free space and recovery succeeds on the next start. Do not run this on production storage.

Diagnostic Commands

All commands below are read-only. First find out why recovery failed.

# Pull the recovery failure and its proximate cause
grep -E "Could not recover log|error in one of the threads|Unable to allocate log segment|No space left|KafkaStorageException" \
  /var/log/kafka/server.log | tail -30
# Capacity and inode headroom — the most common cause
df -h /var/lib/kafka
df -i /var/lib/kafka
du -sh /var/lib/kafka/*/ 2>/dev/null | sort -rh | head
# Ownership, mount state, and read-only check
ls -la /var/lib/kafka
findmnt -T /var/lib/kafka
# Inspect the partition that failed to recover, offline
kafka-dump-log.sh --files /var/lib/kafka/events-12/00000000000003145728.log \
  --deep-iteration --print-data-log | tail -20
ls -la /var/lib/kafka/events-12/
# Disk health and kernel IO errors around the crash/restart
sudo smartctl --health /dev/nvme1n1
journalctl -k --since "3 hours ago" | grep -iE "I/O error|EXT4-fs error|remounting.*read-only"
journalctl -u kafka --since "1 hour ago" | grep -iE "Recovering|loading log|Shutdown broker"

If df -h shows 100% or df -i shows exhausted inodes, it is a space problem. If kafka-dump-log.sh reports corruption, the segment is the problem. If findmnt shows ro, the filesystem is read-only.

Step-by-Step Resolution

  1. Determine the proximate cause from the log: No space left (full disk), corruption (kafka-dump-log.sh confirms), IO error (kernel logs), permissions, or read-only mount.
  2. For a full disk — the common case — free space first. Remove non-Kafka files from the volume, or temporarily lower retention is not possible while the broker is down, so reclaim space at the OS level (delete old logs/temp files) or extend the volume. Even a few gigabytes of headroom lets recovery complete.
  3. Restart and let recovery finish. With space available, the broker truncates, rebuilds indexes, and brings partitions online. Do not interrupt it.
  4. For unrecoverable corruption, if the topic has replicas, remove this broker’s copy of the partition directory and let it re-replicate from the leader after startup. For a single replica, move the corrupt segment files aside to let the broker start, accepting the loss.
  5. For permissions or read-only mount, restore kafka:kafka ownership (sudo chown -R kafka:kafka /var/lib/kafka) and remount read-write (after fsck if the FS flagged errors), then restart.
  6. For a failing device, replace it, recreate the log dir, and re-replicate from healthy brokers.
  7. Verify with kafka-log-dirs.sh --describe (dir online) and kafka-topics.sh --describe --under-replicated-partitions once replication settles.

Because recovery failure raises a KafkaStorageException and marks the log dir offline, the directory will only be re-evaluated on the next restart. The log.dir.failure.timeout.ms setting affects when the controller fences replicas of a running broker, not a broker stuck on startup recovery — there the fix is purely to clear the blocker and restart.

Prevention and Best Practices

  • Keep a hard disk-headroom policy (page at 80%, stop-the-bleed at 90%) so brokers never crash into a full-disk recovery trap.
  • Reserve enough free space for recovery overhead; do not run volumes to the brim, because recovery itself needs room to truncate and allocate.
  • Use replication factor 3 with min.insync.replicas=2 so an unrecoverable partition on one broker is never data loss — re-replicate from the leader.
  • Shut down gracefully (SIGTERM) so most restarts skip recovery entirely.
  • Monitor SMART health and inode usage, not just byte capacity; inode exhaustion also breaks recovery writes.
  • Browse related startup and storage failures in the Kafka guides.
  • KafkaStorageException: Stopping serving logs in dir — a runtime log-directory failure, versus this startup recovery failure.
  • Found a corrupted segment / CorruptRecordException — the corruption that can make recovery unrecoverable.
  • Corrupted index found — an index rebuild that succeeds, contrasted with a recovery that fails outright.

Frequently Asked Questions

Why does the broker fail to recover right after it crashed on a full disk? Recovery must write (truncate, rebuild indexes, allocate segments). On a still-full volume those writes fail with ENOSPC, so recovery cannot complete. Free space, then restart.

Is Could not recover log data loss? Not by itself. With replicas you re-replicate the partition from the leader. Loss only occurs for single-replica topics with truly unrecoverable corruption.

What does Unable to allocate log segment mean? The broker tried to create a new segment file during recovery and the filesystem refused, almost always due to no free space (or no free inodes).

Can I just delete the partition to get the broker up? For replicated topics, removing this broker’s copy and re-replicating is the clean approach. Never delete a partition that is the only replica unless you accept losing its data.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.