Kafka Error Guide: 'Snapshot generation failed' Metadata Snapshot Write Error
Fix KRaft 'Snapshot generation failed': diagnose disk-full, permissions, and I/O errors when the controller writes a __cluster_metadata snapshot to checkpoint state.
- #kafka
- #troubleshooting
- #errors
- #kraft
Exact Error Message
KRaft periodically writes a snapshot of the __cluster_metadata log so the log can be truncated and new followers can bootstrap quickly. When the snapshot write fails, the controller logs it and keeps the old snapshot:
[2026-06-29 03:12:09,447] ERROR [SnapshotEmitter id=1] Snapshot generation failed for snapshot 00000000000009850000-0000000487.checkpoint (org.apache.kafka.image.publisher.SnapshotEmitter)
java.io.IOException: No space left on device
at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at org.apache.kafka.snapshot.FileRawSnapshotWriter.freeze(FileRawSnapshotWriter.java:113)
at org.apache.kafka.image.publisher.SnapshotEmitter.maybeEmit(SnapshotEmitter.java:131)
[2026-06-29 03:12:09,460] WARN [QuorumController id=1] Error while generating snapshot; will retry; metadata log will continue to grow (org.apache.kafka.controller.QuorumController)
You may also see permission variants:
java.nio.file.AccessDeniedException: /var/lib/kafka/__cluster_metadata-0/00000000000009850000-0000000487.checkpoint.tmp
What the Error Means
A KRaft snapshot is a point-in-time materialization of cluster metadata at a specific offset and epoch (encoded in the filename, e.g. <endOffset>-<epoch>.checkpoint). The controller writes it to a temporary file, fsyncs, then atomically renames it into place (“freeze”). “Snapshot generation failed” means this write/freeze did not complete — almost always because the underlying filesystem rejected the write (no space, no permission, or an I/O error).
The immediate consequence is not data loss: the previous valid snapshot is retained and the metadata log simply keeps growing because it cannot be truncated past an offset that has not been snapshotted. Over time this is dangerous: the metadata log grows unbounded, disk pressure worsens (often the very cause), and new/lagging followers cannot use a recent snapshot to catch up quickly.
Common Causes
- Disk full on the controller’s metadata/log directory — the most common cause, and frequently self-reinforcing because failed snapshots prevent log truncation.
- Filesystem permissions — the Kafka process user cannot write the
.checkpoint.tmpfile (wrong ownership after a manual restore or volume remount). - Read-only filesystem — a disk error or failed mount remounted the volume read-only.
- I/O errors / failing disk — bad sectors or a degraded volume causing write failures.
- Inode exhaustion — space appears free but no inodes remain to create the temp file.
- Insufficient
log.dircapacity planning for clusters with very large metadata (huge topic/partition counts).
How to Reproduce the Error
Fill or write-protect the controller’s metadata directory and force snapshot pressure:
# Lab only: make the metadata dir unwritable to the kafka user
sudo chown -R root:root /var/lib/kafka/__cluster_metadata-0
sudo chmod -R 500 /var/lib/kafka/__cluster_metadata-0
# Generate metadata churn so a snapshot is attempted, then watch the log
sudo journalctl -u kafka -f | grep -i 'snapshot'
Alternatively, fill the volume (e.g. with a large temp file) so the snapshot write hits “No space left on device.” The controller logs “Snapshot generation failed” and warns the metadata log will keep growing.
Diagnostic Commands
All read-only.
# Disk space and inodes on the metadata/log directory
df -h /var/lib/kafka
df -i /var/lib/kafka
# Ownership / permissions of the metadata dir and existing snapshots
ls -ld /var/lib/kafka/__cluster_metadata-0
ls -la /var/lib/kafka/__cluster_metadata-0/*.checkpoint 2>/dev/null
# Is the filesystem mounted read-only?
mount | grep -E 'kafka|$(df --output=source /var/lib/kafka | tail -1)'
# Snapshot errors and growth warnings in the controller log
grep -iE 'snapshot generation failed|error while generating snapshot|No space|AccessDenied|Read-only' \
/var/log/kafka/controller.log | tail -40
# How big has the metadata log grown (symptom of failed truncation)?
du -sh /var/lib/kafka/__cluster_metadata-0
# Disk health / I/O errors
dmesg | grep -iE 'i/o error|ext4-fs error|xfs|remount' | tail -20
# Confirm the last good snapshot the node actually holds (read-only)
kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --status
If df shows the volume full or df -i shows zero free inodes, that is your cause. If space is fine, check ownership and read-only mounts.
Step-by-Step Resolution
- Free disk space on the metadata volume if it is full. Reclaim space from other directories on the same filesystem; do not hand-delete files inside
__cluster_metadata-0. Once space is available, the controller retries the snapshot and then truncates the grown log. - Fix ownership and permissions so the Kafka service user owns the metadata directory and can write temp files: restore the correct user/group and mode used by your packaging (typically the
kafkauser). - Remount the filesystem read-write if a disk error forced read-only, after checking the disk (
dmesg, SMART). Replace failing disks. - Resolve inode exhaustion by clearing small-file clutter elsewhere on the volume or growing/reformatting with more inodes.
- Confirm a snapshot succeeds by watching
controller.logfor a successful emit and checking thatdu -sh __cluster_metadata-0stops growing and a newer.checkpointappears. - Right-size storage for the metadata volume so future snapshots always have headroom.
Prevention and Best Practices
- Put
__cluster_metadataon a volume with generous free headroom and alert at 75–80% usage; failed snapshots are usually a downstream symptom of disk pressure. - Monitor inodes, not just bytes, on the metadata filesystem.
- Lock down ownership of
log.dir/metadata.log.dirin configuration management so restores and remounts cannot strip write access from the Kafka user. - Alert specifically on
controller.loglines matching “Snapshot generation failed” — they signal the metadata log will grow until fixed. - Watch disk SMART/
dmesgfor I/O errors so a failing disk is replaced before it forces read-only. - Track
__cluster_metadata-0directory size as a metric; steady growth without truncation indicates snapshots are failing.
Related Errors
- Metadata log corruption detected — a damaged segment, sometimes following a disk-full or I/O-error event.
- Unable to fetch metadata log — followers that cannot catch up because no recent snapshot exists to bootstrap from.
- Failed to append metadata record — the same disk/permission problems can block appends, not just snapshots. See the Kafka guides.
Frequently Asked Questions
Do I lose metadata when a snapshot fails? No. The prior valid snapshot and the full metadata log are intact. The risk is the log growing unbounded until snapshots succeed again.
Why does the disk keep filling after a failed snapshot? Because the metadata log cannot be truncated past an un-snapshotted offset, it keeps growing — worsening the disk pressure that likely caused the failure. It is a vicious cycle; free space promptly.
Can I delete old .checkpoint files to free space? No. Deleting snapshots by hand can corrupt the node’s metadata state. Reclaim space elsewhere on the volume instead.
Is one controller failing snapshots a cluster outage? Not immediately — other controllers still snapshot. But that node’s log grows and it becomes a weak quorum member, so fix it promptly.
How often does KRaft snapshot? It emits based on configured thresholds (bytes/records since the last snapshot). Heavy metadata change rates trigger snapshots more often, making disk headroom more important.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.