Kafka Error Guide: 'Found a corrupted segment' Corrupt Log Segment on Load
Fix Kafka corrupted log segment errors: diagnose unclean shutdowns, truncated segments, and 'Unexpected EOF while reading log' so a broker can finish startup recovery.
- #kafka
- #troubleshooting
- #errors
- #storage
Exact Error Message
During startup the LogManager loads every partition and validates its segments. A damaged data segment produces a corruption error in the broker server.log:
[2026-06-29 06:41:09,512] WARN [Log partition=orders-3, dir=/var/lib/kafka] Found a corrupted segment with base offset 1048576 due to truncated data (kafka.log.LogSegment)
[2026-06-29 06:41:09,517] WARN [Log partition=orders-3, dir=/var/lib/kafka] Recovering unflushed segment 1048576 (kafka.log.UnifiedLog)
[2026-06-29 06:41:09,640] ERROR Encountered error while recovering segment for orders-3 (kafka.log.LogSegment)
org.apache.kafka.common.errors.CorruptRecordException: Found record size -1 smaller than minimum record overhead at offset 1182041 in segment 00000000000001048576.log
[2026-06-29 06:41:09,701] WARN [Log partition=orders-3] Unexpected EOF while reading log segment 00000000000001048576.log; truncating to valid size 81993728 (kafka.log.LogSegment)
Common variants are Found a corrupted segment, Corrupt message, CorruptRecordException, and Unexpected EOF while reading log. They all mean the on-disk record bytes do not match the expected format or checksum.
What the Error Means
Each Kafka log segment is a .log file of length-prefixed, CRC-checksummed record batches. On startup (or after an unclean shutdown), the broker recovers any segment that was not cleanly flushed: it reads batches sequentially, validating size and CRC. If it finds a batch whose declared size is impossible (negative or absurdly large), a bad CRC, or it runs off the end of the file mid-batch, it reports a corrupted segment.
For an unflushed (recoverable) segment, Kafka truncates the file back to the last valid batch boundary and continues — you lose only the partially written tail that was never acknowledged with the right durability settings. For a segment that should have been fully flushed, corruption indicates real damage and recovery may fail, blocking startup until the bad segment is dealt with.
This is a data-integrity event at the segment level, not a whole-disk failure.
Common Causes
- Unclean shutdown. A power loss, OOM kill, or
kill -9during a write leaves the active segment with a half-written final batch and no clean shutdown marker, so recovery is required. - Truncated segment. The process died mid-write, or a copy/restore (rsync, snapshot) captured a segment file while it was being appended.
- Page-cache loss without fsync. Data acknowledged only in the OS page cache was lost on crash because it was never flushed to disk.
- Underlying storage bit-rot or IO error. A bad block flips bytes inside an already-flushed segment, failing the CRC check.
- Manual tampering. Someone edited, partially deleted, or truncated files under
log.dirsdirectly.
How to Reproduce the Error
In a throwaway test cluster, an unclean shutdown reliably produces recoverable corruption. Produce a stream of messages, then hard-kill the broker mid-write:
# Test broker only. Hard-kill while actively producing to leave a partial tail batch.
kill -9 "$(pgrep -f 'kafka.Kafka')"
On the next start, the broker logs Found a corrupted segment ... due to truncated data for the active segment and truncates it to the last valid offset. To simulate hard damage to a flushed segment, append garbage to a .log file (printf '\xff\xff\xff\xff' >> segment.log) on a stopped test broker; startup then reports a CorruptRecordException that cannot be auto-recovered. Never do either on production data.
Diagnostic Commands
All commands here only read files. The most useful tool is kafka-dump-log.sh, which parses a segment offline and validates each batch.
# Inspect a suspect segment offline; --deep-iteration validates every record CRC
kafka-dump-log.sh --files /var/lib/kafka/orders-3/00000000000001048576.log \
--deep-iteration --print-data-log | tail -40
# Locate the corruption/EOF lines and the partition involved
grep -E "corrupted segment|CorruptRecordException|Unexpected EOF|Recovering unflushed" \
/var/log/kafka/server.log | tail -30
# Find the offending segment files and their sizes
ls -la /var/lib/kafka/orders-3/ | grep -E "\.log$|\.index$"
du -sh /var/lib/kafka/orders-3/
# Did the broker shut down uncleanly? Check for a missing clean-shutdown marker
ls -la /var/lib/kafka/.kafka_cleanshutdown 2>/dev/null
journalctl -u kafka --since "2 hours ago" | grep -iE "SIGKILL|out of memory|power|terminated"
# Rule out a failing device underneath the corruption
sudo smartctl --health /dev/nvme1n1
journalctl -k --since "2 hours ago" | grep -iE "I/O error|EXT4-fs error"
kafka-dump-log.sh --deep-iteration will report the exact offset where parsing breaks, telling you whether only the tail is bad (recoverable) or a mid-file batch is damaged (real loss).
Step-by-Step Resolution
- Read the log carefully. If you see
Recovering unflushed segmentfollowed bytruncating to valid size, recovery is handling it automatically — let startup finish. Only the unacknowledged tail is dropped. - If recovery throws and startup halts, identify the exact segment from the error and inspect it with
kafka-dump-log.sh --deep-iterationto find where parsing fails. - Prefer replication-based recovery. If replication factor is greater than 1 and other replicas are healthy, the cleanest fix is to remove that broker’s copy of the damaged partition directory and let it re-fetch a clean copy from the leader. Stop the broker, move the partition directory aside, restart, and let it re-replicate.
- For a single-replica topic with hard corruption, the damaged segment cannot be reconstructed. You can move the corrupt
.log/.index/.timeindextriplet aside to let the broker start and serve the surviving segments, accepting the loss of that segment’s records. - Restart and watch recovery. The broker rebuilds indexes for any truncated segment and resumes serving.
- Confirm health. Re-run
kafka-dump-log.shon the recovered segment and checkkafka-topics.sh --describe --under-replicated-partitionsuntil replication catches up.
Note that a startup recovery that fails with a KafkaStorageException will also mark the log dir offline; if so, treat the directory-failure path as well after fixing the segment.
Prevention and Best Practices
- Run replication factor 3 with
min.insync.replicas=2andacks=allso an unacknowledged tail truncated during recovery is never data your producers believed was durable. - Avoid
kill -9on brokers; use graceful shutdown so segments flush and the clean-shutdown marker is written, skipping recovery entirely. - Never copy live
log.dirswith rsync/snapshots while the broker is writing; back up via replication or stop the broker first. - Provision adequate heap and set sane
vm.dirty_*so the broker is not OOM-killed and the OS flushes dirty pages predictably. - Monitor
smartctlhealth on data devices to catch bit-rot before it corrupts flushed segments. - Browse more storage-layer failure patterns in the Kafka guides for related index and recovery errors.
Related Errors
Corrupted index found / Found invalid offset index— index-file corruption, which Kafka rebuilds, versus data-segment corruption here.Could not recover log— when segment recovery itself fails and blocks startup.KafkaStorageException: Stopping serving logs in dir— a whole log-directory failure that can accompany unrecoverable corruption.
Frequently Asked Questions
Does a corrupted segment mean I lost data?
Usually only the unacknowledged tail. With acks=all and replication, the lost bytes were never confirmed durable. Mid-file corruption of a flushed segment on a single replica is real loss.
Why does this appear after a crash? A hard stop leaves the active segment partially written with no clean-shutdown marker, so the broker must recover it on the next start and truncates the partial final batch.
Is it safe to delete the corrupt segment files? Only as a last resort for single-replica topics; you lose that segment’s records. With replicas, prefer removing the broker’s partition copy and re-replicating from the leader.
What does kafka-dump-log.sh --deep-iteration do?
It parses the segment offline and validates every record batch CRC, pinpointing the offset where corruption begins without touching the file.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.