Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Found a corrupted segment' Corrupt Log Segment on Load

Fix Kafka corrupted log segment errors: diagnose unclean shutdowns, truncated segments, and 'Unexpected EOF while reading log' so a broker can finish startup recovery.

  • #kafka
  • #troubleshooting
  • #errors
  • #storage

Exact Error Message

During startup the LogManager loads every partition and validates its segments. A damaged data segment produces a corruption error in the broker server.log:

[2026-06-29 06:41:09,512] WARN [Log partition=orders-3, dir=/var/lib/kafka] Found a corrupted segment with base offset 1048576 due to truncated data (kafka.log.LogSegment)
[2026-06-29 06:41:09,517] WARN [Log partition=orders-3, dir=/var/lib/kafka] Recovering unflushed segment 1048576 (kafka.log.UnifiedLog)
[2026-06-29 06:41:09,640] ERROR Encountered error while recovering segment for orders-3 (kafka.log.LogSegment)
org.apache.kafka.common.errors.CorruptRecordException: Found record size -1 smaller than minimum record overhead at offset 1182041 in segment 00000000000001048576.log
[2026-06-29 06:41:09,701] WARN [Log partition=orders-3] Unexpected EOF while reading log segment 00000000000001048576.log; truncating to valid size 81993728 (kafka.log.LogSegment)

Common variants are Found a corrupted segment, Corrupt message, CorruptRecordException, and Unexpected EOF while reading log. They all mean the on-disk record bytes do not match the expected format or checksum.

What the Error Means

Each Kafka log segment is a .log file of length-prefixed, CRC-checksummed record batches. On startup (or after an unclean shutdown), the broker recovers any segment that was not cleanly flushed: it reads batches sequentially, validating size and CRC. If it finds a batch whose declared size is impossible (negative or absurdly large), a bad CRC, or it runs off the end of the file mid-batch, it reports a corrupted segment.

For an unflushed (recoverable) segment, Kafka truncates the file back to the last valid batch boundary and continues — you lose only the partially written tail that was never acknowledged with the right durability settings. For a segment that should have been fully flushed, corruption indicates real damage and recovery may fail, blocking startup until the bad segment is dealt with.

This is a data-integrity event at the segment level, not a whole-disk failure.

Common Causes

  • Unclean shutdown. A power loss, OOM kill, or kill -9 during a write leaves the active segment with a half-written final batch and no clean shutdown marker, so recovery is required.
  • Truncated segment. The process died mid-write, or a copy/restore (rsync, snapshot) captured a segment file while it was being appended.
  • Page-cache loss without fsync. Data acknowledged only in the OS page cache was lost on crash because it was never flushed to disk.
  • Underlying storage bit-rot or IO error. A bad block flips bytes inside an already-flushed segment, failing the CRC check.
  • Manual tampering. Someone edited, partially deleted, or truncated files under log.dirs directly.

How to Reproduce the Error

In a throwaway test cluster, an unclean shutdown reliably produces recoverable corruption. Produce a stream of messages, then hard-kill the broker mid-write:

# Test broker only. Hard-kill while actively producing to leave a partial tail batch.
kill -9 "$(pgrep -f 'kafka.Kafka')"

On the next start, the broker logs Found a corrupted segment ... due to truncated data for the active segment and truncates it to the last valid offset. To simulate hard damage to a flushed segment, append garbage to a .log file (printf '\xff\xff\xff\xff' >> segment.log) on a stopped test broker; startup then reports a CorruptRecordException that cannot be auto-recovered. Never do either on production data.

Diagnostic Commands

All commands here only read files. The most useful tool is kafka-dump-log.sh, which parses a segment offline and validates each batch.

# Inspect a suspect segment offline; --deep-iteration validates every record CRC
kafka-dump-log.sh --files /var/lib/kafka/orders-3/00000000000001048576.log \
  --deep-iteration --print-data-log | tail -40
# Locate the corruption/EOF lines and the partition involved
grep -E "corrupted segment|CorruptRecordException|Unexpected EOF|Recovering unflushed" \
  /var/log/kafka/server.log | tail -30
# Find the offending segment files and their sizes
ls -la /var/lib/kafka/orders-3/ | grep -E "\.log$|\.index$"
du -sh /var/lib/kafka/orders-3/
# Did the broker shut down uncleanly? Check for a missing clean-shutdown marker
ls -la /var/lib/kafka/.kafka_cleanshutdown 2>/dev/null
journalctl -u kafka --since "2 hours ago" | grep -iE "SIGKILL|out of memory|power|terminated"
# Rule out a failing device underneath the corruption
sudo smartctl --health /dev/nvme1n1
journalctl -k --since "2 hours ago" | grep -iE "I/O error|EXT4-fs error"

kafka-dump-log.sh --deep-iteration will report the exact offset where parsing breaks, telling you whether only the tail is bad (recoverable) or a mid-file batch is damaged (real loss).

Step-by-Step Resolution

  1. Read the log carefully. If you see Recovering unflushed segment followed by truncating to valid size, recovery is handling it automatically — let startup finish. Only the unacknowledged tail is dropped.
  2. If recovery throws and startup halts, identify the exact segment from the error and inspect it with kafka-dump-log.sh --deep-iteration to find where parsing fails.
  3. Prefer replication-based recovery. If replication factor is greater than 1 and other replicas are healthy, the cleanest fix is to remove that broker’s copy of the damaged partition directory and let it re-fetch a clean copy from the leader. Stop the broker, move the partition directory aside, restart, and let it re-replicate.
  4. For a single-replica topic with hard corruption, the damaged segment cannot be reconstructed. You can move the corrupt .log/.index/.timeindex triplet aside to let the broker start and serve the surviving segments, accepting the loss of that segment’s records.
  5. Restart and watch recovery. The broker rebuilds indexes for any truncated segment and resumes serving.
  6. Confirm health. Re-run kafka-dump-log.sh on the recovered segment and check kafka-topics.sh --describe --under-replicated-partitions until replication catches up.

Note that a startup recovery that fails with a KafkaStorageException will also mark the log dir offline; if so, treat the directory-failure path as well after fixing the segment.

Prevention and Best Practices

  • Run replication factor 3 with min.insync.replicas=2 and acks=all so an unacknowledged tail truncated during recovery is never data your producers believed was durable.
  • Avoid kill -9 on brokers; use graceful shutdown so segments flush and the clean-shutdown marker is written, skipping recovery entirely.
  • Never copy live log.dirs with rsync/snapshots while the broker is writing; back up via replication or stop the broker first.
  • Provision adequate heap and set sane vm.dirty_* so the broker is not OOM-killed and the OS flushes dirty pages predictably.
  • Monitor smartctl health on data devices to catch bit-rot before it corrupts flushed segments.
  • Browse more storage-layer failure patterns in the Kafka guides for related index and recovery errors.
  • Corrupted index found / Found invalid offset index — index-file corruption, which Kafka rebuilds, versus data-segment corruption here.
  • Could not recover log — when segment recovery itself fails and blocks startup.
  • KafkaStorageException: Stopping serving logs in dir — a whole log-directory failure that can accompany unrecoverable corruption.

Frequently Asked Questions

Does a corrupted segment mean I lost data? Usually only the unacknowledged tail. With acks=all and replication, the lost bytes were never confirmed durable. Mid-file corruption of a flushed segment on a single replica is real loss.

Why does this appear after a crash? A hard stop leaves the active segment partially written with no clean-shutdown marker, so the broker must recover it on the next start and truncates the partial final batch.

Is it safe to delete the corrupt segment files? Only as a last resort for single-replica topics; you lose that segment’s records. With replicas, prefer removing the broker’s partition copy and re-replicating from the leader.

What does kafka-dump-log.sh --deep-iteration do? It parses the segment offline and validates every record batch CRC, pinpointing the offset where corruption begins without touching the file.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.