Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Failed to append metadata record' Raft Append Failure

Fix KRaft 'Failed to append metadata record' to __cluster_metadata: diagnose lost leadership, no quorum, disk-full, and timeout failures on the Raft write path.

  • #kafka
  • #troubleshooting
  • #errors
  • #kraft

Exact Error Message

When the active controller cannot durably append a record to the __cluster_metadata Raft log, the write fails and the operation that triggered it (topic create, config change, partition reassignment, broker registration) errors out:

[2026-06-29 15:07:22,884] ERROR [QuorumController id=1] Failed to append metadata record to __cluster_metadata-0 at offset 9,902,118 (org.apache.kafka.controller.QuorumController)
org.apache.kafka.common.errors.NotLeaderException: Append failed because the replica is no longer the leader of __cluster_metadata-0 (current leaderEpoch 491)
    at org.apache.kafka.raft.KafkaRaftClient.scheduleAtomicAppend(KafkaRaftClient.java:2284)
[2026-06-29 15:07:22,890] WARN [QuorumController id=1] Unable to commit batch; reverting in-memory metadata delta (org.apache.kafka.controller.QuorumController)

A disk-full variant on the leader:

java.io.IOException: No space left on device
    at org.apache.kafka.raft.internals.BatchAccumulator.append(BatchAccumulator.java:201)
[2026-06-29 15:07:23,002] ERROR [RaftManager id=1] Failed to append metadata record; log write rejected (org.apache.kafka.raft.KafkaRaftClient)

What the Error Means

Every metadata mutation in KRaft becomes a record the active controller appends to its __cluster_metadata Raft log, then replicates and commits once a majority of voters acknowledge. “Failed to append metadata record” means this append did not succeed on the leader. Either the node lost leadership mid-append (it is no longer allowed to write), the local log write failed (disk full, I/O error), or the append could not be committed because a quorum acknowledgment never arrived.

The controller reverts the in-memory metadata delta so it does not apply an uncommitted change, and the triggering admin operation fails or retries. Crucially, no partial metadata is applied — the design is all-or-nothing per batch. Persistent append failures freeze all metadata changes even if existing produce/consume traffic keeps flowing on cached metadata.

Common Causes

  • Lost leadership mid-append (NotLeaderException / NOT_LEADER_OR_FOLLOWER) — an election happened and this node is no longer the leader, so it may not append. Usually transient.
  • No quorum to commit — a majority of voters is down or unreachable, so appended records can never be acknowledged and committed.
  • Disk full or I/O error on the leader’s metadata volume, rejecting the local log write.
  • Filesystem permissions / read-only mount preventing the leader from writing the active segment.
  • Slow followers so far behind that acknowledgments time out before commit.
  • Network latency/loss on the controller listener delaying or dropping the replication acks needed to commit.

How to Reproduce the Error

Trigger leadership loss during metadata writes, or exhaust disk on the leader:

# Variant A: kill leadership during churn
kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --status   # note leader
# In one shell, create/delete many topics; in another, stop the leader (lab only)
sudo systemctl stop kafka   # on the current leader
sudo journalctl -u kafka -f | grep -iE 'append|NotLeader|commit'

# Variant B: fill the leader's metadata volume (lab only), then attempt a topic create
#   -> append is rejected with "No space left on device"

Admin operations issued during the gap fail with “Failed to append metadata record,” and the controller log shows either a NotLeaderException or an I/O error depending on the variant.

Diagnostic Commands

All read-only.

# Who is leader now, and is the high watermark advancing (can it commit)?
kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --status

# Follower acks: are enough voters in-sync to commit a quorum?
kafka-metadata-quorum.sh --bootstrap-controller c1:9093 describe --replication

# Append failures and their reasons in the controller log
grep -iE 'failed to append|NotLeader|log write rejected|No space|unable to commit' \
  /var/log/kafka/controller.log | tail -50

# Disk space / inodes / mount state on the leader's metadata volume
df -h /var/lib/kafka
df -i /var/lib/kafka
mount | grep -i kafka

# Tail of the metadata log to confirm where appends stopped (read-only decode)
kafka-dump-log.sh --cluster-metadata-decoder \
  --files /var/lib/kafka/__cluster_metadata-0/*.log | tail -30

# Disk health
dmesg | grep -iE 'i/o error|read-only|ext4-fs error' | tail -20

If describe --status shows a stable leader and an advancing HighWatermark, an isolated append failure was transient (leadership blip) and retried successfully. A stalled HighWatermark means commits cannot complete — look at quorum or disk.

Step-by-Step Resolution

  1. Determine the failure flavor from the controller log: NotLeaderException (leadership) vs No space/AccessDenied/Read-only (disk) vs commit timeout (quorum/network).
  2. For leadership loss: confirm a single stable leader via describe --status. The admin operation simply needs to be retried against the new leader; KRaft and most admin clients retry automatically. No further action if a leader is stable.
  3. For disk-full/I-O: free space on the leader’s metadata volume (do not delete files inside __cluster_metadata-0), fix permissions or remount read-write, or replace a failing disk. Appends resume once the volume accepts writes.
  4. For no-quorum/commit timeouts: restore a majority of voters and the controller-listener network so appended records can be acknowledged and committed. The HighWatermark will start advancing again.
  5. Address slow followers (see the catch-up guide) if acks are timing out because a voter cannot keep pace.
  6. Verify recovery by performing a benign metadata operation (e.g. describe/create in a test namespace per your process) and confirming HighWatermark advances with no new append errors.

Prevention and Best Practices

  • Keep an odd, healthy quorum so a majority is always available to commit appends.
  • Provision the metadata volume with ample free space and alert at 75–80%; disk-full is a top cause of append failures.
  • Lock down ownership/permissions of log.dir/metadata.log.dir so writes are never blocked after restores or remounts.
  • Keep followers in-sync (fast disks, low-latency network) so commit acks are never the bottleneck.
  • Expect occasional transient NotLeaderException around failovers and rely on client retries rather than alerting on single events.
  • Monitor HighWatermark progression; a flat high watermark under load means commits are stuck.
  • For triage, the free incident assistant can classify an append failure from the controller log quickly.
  • Metadata quorum unavailable — no majority to commit appends at all.
  • Snapshot generation failed — the same disk/permission faults that block appends often block snapshots.
  • Leader epoch mismatch — the fencing that turns a stale leader’s append into a NotLeaderException. See the Kafka guides.

Frequently Asked Questions

Did my topic get half-created when the append failed? No. Metadata appends are atomic per batch and reverted on failure, so you never get a partially applied change. The operation either committed or did not.

Is a single NotLeaderException something to worry about? Usually not. It typically means an election just occurred; the client retries against the new leader and succeeds. Worry only if it persists.

Why can’t the leader commit even though it appended locally? Commit requires a majority of voters to acknowledge. If quorum is lost or followers lag, the local append never commits and is eventually reverted.

Could this lose committed metadata? No. Only uncommitted appends are reverted. Anything already committed (acknowledged by a majority) is durable.

Will produce/consume traffic stop? Existing traffic on cached metadata may continue, but no new metadata changes (topics, configs, reassignments, broker registration) succeed until appends work again.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.