Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: '[KafkaServer id=1] shutting down' Graceful vs Crash

Read Kafka '[KafkaServer id=1] shutting down' and 'started' lifecycle lines: tell a graceful controlled.shutdown from an abnormal crash and trace the real trigger.

  • #kafka
  • #troubleshooting
  • #errors
  • #lifecycle

Exact Error Message

A clean broker lifecycle in server.log looks like this — a startup, a normal run, then an orderly shutdown:

[2026-06-29 08:00:11,204] INFO [KafkaServer id=1] started (kafka.server.KafkaServer)
...
[2026-06-29 14:32:07,551] INFO [KafkaServer id=1] Starting controlled shutdown (kafka.server.KafkaServer)
[2026-06-29 14:32:08,002] INFO [KafkaServer id=1] Controlled shutdown request returned successfully after 1 retries (kafka.server.KafkaServer)
[2026-06-29 14:32:08,210] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer)
[2026-06-29 14:32:09,884] INFO [KafkaServer id=1] shut down completed (kafka.server.KafkaServer)

An abnormal shutdown looks different — shutting down appears with no preceding Starting controlled shutdown, often right after an error:

[2026-06-29 14:32:07,118] ERROR [ReplicaManager broker=1] Error processing append operation (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: I/O exception in append to log
[2026-06-29 14:32:07,140] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer)

What the Error Means

[KafkaServer id=1] shutting down is not, by itself, an error — it is a lifecycle event. The diagnostic value is in the context. A graceful shutdown is preceded by Starting controlled shutdown and Controlled shutdown request returned successfully, followed by shut down completed. During controlled shutdown the broker asks the controller to move leadership of its partitions to other replicas first, so clients see minimal disruption.

An abnormal shutdown skips the controlled-shutdown handshake. You see shutting down with no controlled-shutdown lines, frequently immediately after a FATAL or ERROR, or the log simply stops (a hard kill or OOM leaves no shut down completed). Telling these apart is the first step in any “the broker restarted” investigation. Likewise, [KafkaServer id=1] started marks the moment the broker finished initialization and is serving — its absence after a restart means startup itself failed.

Common Causes

  • Operator or orchestrator stop: A systemctl stop, deployment rollout, or autoscaler termination sends SIGTERM, which Kafka handles as a graceful controlled shutdown.
  • JVM OutOfMemoryError: A heap exhaustion can trigger shutdown (commonly via an ExitOnOutOfMemoryError-style handler), usually with an OOM line just before.
  • Fatal runtime errors: A KafkaStorageException (disk failure), unrecoverable log corruption, or controller-side fencing can drive the broker to shut down.
  • SIGKILL / OOM killer: The Linux OOM killer or a kill -9 terminates the JVM instantly, leaving the log truncated with no completion line.
  • Failed controlled shutdown: If the broker cannot reach the controller to hand off leadership, controlled shutdown retries and may eventually shut down “ungracefully” anyway.

How to Reproduce the Error

A graceful shutdown is trivial to reproduce on a test broker:

# This produces the full controlled-shutdown sequence in server.log:
#   Starting controlled shutdown -> returned successfully -> shutting down -> shut down completed
sudo systemctl stop kafka

To see an abnormal pattern, send SIGKILL to the JVM (kill -9 <pid>); the log stops without shut down completed, and the next start logs recovery of unflushed segments. To see an error-driven shutdown, fill the log dir’s disk so an append triggers KafkaStorageException.

Diagnostic Commands

Look at the lifecycle lines together to classify the shutdown:

grep -nE "started|Starting controlled shutdown|Controlled shutdown request returned|shutting down|shut down completed" /var/log/kafka/server.log | tail -30

Check whether any error preceded the shutdown:

grep -nE "FATAL|ERROR|OutOfMemory|KafkaStorageException" /var/log/kafka/server.log | tail -40

Use the service journal to see whether the OS or service manager initiated the stop (and to catch OOM-killer events):

journalctl -u kafka --no-pager -n 100 | grep -iE "stopp|signal|kill|terminat|oom|main process exited"

Check the kernel log for the OOM killer specifically:

journalctl -k --no-pager | grep -iE "killed process|out of memory" | tail -20

Confirm whether the broker came back and is serving:

kafka-broker-api-versions.sh --bootstrap-server localhost:9092 | head -5

In KRaft clusters, confirm the quorum and which controller is active after a restart:

kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status

Step-by-Step Resolution

  1. Classify the shutdown. If you see Starting controlled shutdownreturned successfullyshut down completed, it was graceful — almost always an intentional stop or rolling restart. Move to “who stopped it” rather than “what broke.”
  2. If graceful, find the initiator. The journalctl -u kafka output shows whether systemd stopped it (deploy, manual stop) or the process exited on SIGTERM from an orchestrator.
  3. If abnormal, find the trigger. Look at the lines immediately before shutting down. An OOM, KafkaStorageException, or FATAL names the cause; act on that specific error.
  4. If the log just stops with no completion line, suspect SIGKILL or the OOM killer. Confirm with the kernel-log command; if OOM, review heap sizing and host memory pressure.
  5. For failed controlled shutdown, check controller/quorum reachability — the broker could not hand off leadership, so the stop took the slow path.
  6. Verify recovery. After restart, confirm [KafkaServer id=1] started, that the broker answers kafka-broker-api-versions.sh, and that under-replicated partitions have recovered.

Prevention and Best Practices

  • Always stop brokers via the service manager (which sends SIGTERM) so controlled shutdown runs and leadership migrates cleanly; never default to kill -9.
  • Ensure controlled.shutdown.enable=true (the default) and give brokers enough shutdown timeout in your orchestrator for leadership handoff to finish.
  • Size the JVM heap conservatively and leave memory for the page cache; monitor for OOM so you fix it before it forces shutdowns.
  • During rolling restarts, wait for under-replicated partitions to return to zero between brokers so a graceful stop never coincides with a degraded cluster.
  • Alert on unexpected shutting down lines that lack a preceding controlled-shutdown sequence — those are the ones worth paging on.
  • For ambiguous restart incidents, the free incident assistant can correlate the lifecycle and error lines into a likely cause.
  • Fatal error during KafkaServer startup — when the next start after a shutdown fails to come back.
  • KafkaStorageException — a common disk-level trigger for abnormal shutdown.
  • java.lang.OutOfMemoryError — heap exhaustion that can force the broker down.
  • Controlled shutdown failed — leadership handoff could not complete before stop.

Frequently Asked Questions

Is “shutting down” an error I need to fix? Not on its own. It is a normal lifecycle line. What matters is whether it was preceded by the controlled-shutdown handshake (graceful) or by an error/nothing (abnormal). Only abnormal shutdowns need root-causing.

How do I know the shutdown was clean? Look for the full sequence: Starting controlled shutdown, Controlled shutdown request returned successfully, shutting down, and finally shut down completed. All four present and in order means a clean, intentional stop.

The log ends with no “shut down completed” — what happened? That signature means the JVM died without running its shutdown hooks, typically a SIGKILL or the Linux OOM killer. Check the kernel log; the broker had no chance to migrate leadership, so expect a recovery pass on next start.

What does “[KafkaServer id=1] started” tell me? It marks the instant the broker finished initialization and began serving requests. If a restart never logs started, startup itself failed — investigate that as a startup error, not a shutdown.

Why did a graceful stop still cause client errors? If the cluster was already under-replicated, or controlled shutdown could not reach the controller to hand off leadership, clients can briefly hit NotLeaderForPartition. Pacing rolling restarts and waiting for full replication between brokers prevents this.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.