Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Snapshot generation failed' Metadata Snapshot Write Error

Fix KRaft 'Snapshot generation failed': diagnose disk-full, permissions, and I/O errors when the controller writes a __cluster_metadata snapshot to checkpoint state.

  • #kafka
  • #troubleshooting
  • #errors
  • #kraft

Exact Error Message

KRaft periodically writes a snapshot of the __cluster_metadata log so the log can be truncated and new followers can bootstrap quickly. When the snapshot write fails, the controller logs it and keeps the old snapshot:

[2026-06-29 03:12:09,447] ERROR [SnapshotEmitter id=1] Snapshot generation failed for snapshot 00000000000009850000-0000000487.checkpoint (org.apache.kafka.image.publisher.SnapshotEmitter)
java.io.IOException: No space left on device
    at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at org.apache.kafka.snapshot.FileRawSnapshotWriter.freeze(FileRawSnapshotWriter.java:113)
    at org.apache.kafka.image.publisher.SnapshotEmitter.maybeEmit(SnapshotEmitter.java:131)
[2026-06-29 03:12:09,460] WARN [QuorumController id=1] Error while generating snapshot; will retry; metadata log will continue to grow (org.apache.kafka.controller.QuorumController)

You may also see permission variants:

java.nio.file.AccessDeniedException: /var/lib/kafka/__cluster_metadata-0/00000000000009850000-0000000487.checkpoint.tmp

What the Error Means

A KRaft snapshot is a point-in-time materialization of cluster metadata at a specific offset and epoch (encoded in the filename, e.g. <endOffset>-<epoch>.checkpoint). The controller writes it to a temporary file, fsyncs, then atomically renames it into place (“freeze”). “Snapshot generation failed” means this write/freeze did not complete — almost always because the underlying filesystem rejected the write (no space, no permission, or an I/O error).

The immediate consequence is not data loss: the previous valid snapshot is retained and the metadata log simply keeps growing because it cannot be truncated past an offset that has not been snapshotted. Over time this is dangerous: the metadata log grows unbounded, disk pressure worsens (often the very cause), and new/lagging followers cannot use a recent snapshot to catch up quickly.

Common Causes

  • Disk full on the controller’s metadata/log directory — the most common cause, and frequently self-reinforcing because failed snapshots prevent log truncation.
  • Filesystem permissions — the Kafka process user cannot write the .checkpoint.tmp file (wrong ownership after a manual restore or volume remount).
  • Read-only filesystem — a disk error or failed mount remounted the volume read-only.
  • I/O errors / failing disk — bad sectors or a degraded volume causing write failures.
  • Inode exhaustion — space appears free but no inodes remain to create the temp file.
  • Insufficient log.dir capacity planning for clusters with very large metadata (huge topic/partition counts).

How to Reproduce the Error

Fill or write-protect the controller’s metadata directory and force snapshot pressure:

# Lab only: make the metadata dir unwritable to the kafka user
sudo chown -R root:root /var/lib/kafka/__cluster_metadata-0
sudo chmod -R 500 /var/lib/kafka/__cluster_metadata-0

# Generate metadata churn so a snapshot is attempted, then watch the log
sudo journalctl -u kafka -f | grep -i 'snapshot'

Alternatively, fill the volume (e.g. with a large temp file) so the snapshot write hits “No space left on device.” The controller logs “Snapshot generation failed” and warns the metadata log will keep growing.

Diagnostic Commands

All read-only.

# Disk space and inodes on the metadata/log directory
df -h /var/lib/kafka
df -i /var/lib/kafka

# Ownership / permissions of the metadata dir and existing snapshots
ls -ld /var/lib/kafka/__cluster_metadata-0
ls -la /var/lib/kafka/__cluster_metadata-0/*.checkpoint 2>/dev/null

# Is the filesystem mounted read-only?
mount | grep -E 'kafka|$(df --output=source /var/lib/kafka | tail -1)'

# Snapshot errors and growth warnings in the controller log
grep -iE 'snapshot generation failed|error while generating snapshot|No space|AccessDenied|Read-only' \
  /var/log/kafka/controller.log | tail -40

# How big has the metadata log grown (symptom of failed truncation)?
du -sh /var/lib/kafka/__cluster_metadata-0

# Disk health / I/O errors
dmesg | grep -iE 'i/o error|ext4-fs error|xfs|remount' | tail -20

# Confirm the last good snapshot the node actually holds (read-only)
kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --status

If df shows the volume full or df -i shows zero free inodes, that is your cause. If space is fine, check ownership and read-only mounts.

Step-by-Step Resolution

  1. Free disk space on the metadata volume if it is full. Reclaim space from other directories on the same filesystem; do not hand-delete files inside __cluster_metadata-0. Once space is available, the controller retries the snapshot and then truncates the grown log.
  2. Fix ownership and permissions so the Kafka service user owns the metadata directory and can write temp files: restore the correct user/group and mode used by your packaging (typically the kafka user).
  3. Remount the filesystem read-write if a disk error forced read-only, after checking the disk (dmesg, SMART). Replace failing disks.
  4. Resolve inode exhaustion by clearing small-file clutter elsewhere on the volume or growing/reformatting with more inodes.
  5. Confirm a snapshot succeeds by watching controller.log for a successful emit and checking that du -sh __cluster_metadata-0 stops growing and a newer .checkpoint appears.
  6. Right-size storage for the metadata volume so future snapshots always have headroom.

Prevention and Best Practices

  • Put __cluster_metadata on a volume with generous free headroom and alert at 75–80% usage; failed snapshots are usually a downstream symptom of disk pressure.
  • Monitor inodes, not just bytes, on the metadata filesystem.
  • Lock down ownership of log.dir/metadata.log.dir in configuration management so restores and remounts cannot strip write access from the Kafka user.
  • Alert specifically on controller.log lines matching “Snapshot generation failed” — they signal the metadata log will grow until fixed.
  • Watch disk SMART/dmesg for I/O errors so a failing disk is replaced before it forces read-only.
  • Track __cluster_metadata-0 directory size as a metric; steady growth without truncation indicates snapshots are failing.
  • Metadata log corruption detected — a damaged segment, sometimes following a disk-full or I/O-error event.
  • Unable to fetch metadata log — followers that cannot catch up because no recent snapshot exists to bootstrap from.
  • Failed to append metadata record — the same disk/permission problems can block appends, not just snapshots. See the Kafka guides.

Frequently Asked Questions

Do I lose metadata when a snapshot fails? No. The prior valid snapshot and the full metadata log are intact. The risk is the log growing unbounded until snapshots succeed again.

Why does the disk keep filling after a failed snapshot? Because the metadata log cannot be truncated past an un-snapshotted offset, it keeps growing — worsening the disk pressure that likely caused the failure. It is a vicious cycle; free space promptly.

Can I delete old .checkpoint files to free space? No. Deleting snapshots by hand can corrupt the node’s metadata state. Reclaim space elsewhere on the volume instead.

Is one controller failing snapshots a cluster outage? Not immediately — other controllers still snapshot. But that node’s log grows and it becomes a weak quorum member, so fix it promptly.

How often does KRaft snapshot? It emits based on configured thresholds (bytes/records since the last snapshot). Heavy metadata change rates trigger snapshots more often, making disk headroom more important.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.