Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'ConnectionLoss for /brokers/ids' ZooKeeper Connection Loss

Fix Kafka ZooKeeper ConnectionLoss for /brokers/ids: diagnose a downed ensemble, lost quorum, port 2181 firewall blocks, bad zookeeper.connect, and GC pauses.

  • #kafka
  • #troubleshooting
  • #errors
  • #zookeeper

Exact Error Message

[2026-06-29 14:02:18,441] WARN Session 0x10000a2b3c40001 for sever zk-01/10.0.6.11:2181,
unexpected error, closing socket connection and attempting reconnect
(org.apache.zookeeper.ClientCnxn)
java.io.IOException: Connection reset by peer

[2026-06-29 14:02:21,883] INFO Opening socket connection to server zk-02/10.0.6.12:2181
(org.apache.zookeeper.ClientCnxn)
[2026-06-29 14:02:24,884] WARN Client session timed out, have not heard from server in 6003ms
for sessionid 0x10000a2b3c40001 (org.apache.zookeeper.ClientCnxn)

[2026-06-29 14:02:30,512] ERROR Error while creating ephemeral at /brokers/ids/3
(kafka.zk.KafkaZkClient)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /brokers/ids
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
        at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:158)
        at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1947)
        at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:108)
        at kafka.server.KafkaServer.startup(KafkaServer.scala:312)

What the Error Means

This error applies only to legacy, ZooKeeper-based Kafka clusters (Kafka 2.x and earlier, or 3.x clusters not yet migrated). KRaft-mode clusters do not use ZooKeeper at all, so a ConnectionLoss for /brokers/ids exception cannot occur there — if you are on KRaft, your metadata lives in the controller quorum and this guide does not apply.

On a ZooKeeper-based cluster, every broker maintains a persistent client session with the ZooKeeper ensemble. Brokers register themselves as ephemeral znodes under /brokers/ids, and the controller is tracked under /controller. ConnectionLossException means the Kafka ZooKeeper client lost its TCP connection to the ensemble in the middle of a request and could not immediately reconnect to complete it.

Crucially, this is a transient connectivity failure, not a session-level failure. The client will keep retrying. You will often see it interleaved with related messages such as Unable to connect to ZooKeeper, Client session timed out, ZooKeeper client timeout, or ZooKeeper request failed. All of these describe the same underlying condition: the broker cannot reliably reach a ZooKeeper server long enough to finish an operation. When this happens during startup (as in registerBroker), the broker fails to come up; when it happens at runtime, the broker may temporarily drop out of the cluster until it reconnects.

Common Causes

  • The ZooKeeper ensemble is down, or enough nodes are down to lose quorum. A 3-node ensemble tolerates one failure; lose two and there is no quorum, so no node will serve requests.
  • A network partition or firewall blocks port 2181 between the broker and one or more ZooKeeper servers.
  • A wrong or stale zookeeper.connect string in server.properties points the broker at hosts/ports that are unreachable or no longer host ZooKeeper.
  • ZooKeeper is overloaded or stuck in a long JVM GC pause, so it stops responding within the client timeout window.
  • DNS resolution problems — a hostname in the connect string resolves to the wrong IP or fails intermittently.
  • zookeeper.connection.timeout.ms is set too low for the real network latency, so normal blips are treated as hard failures.

How to Reproduce the Error

In a lab, you can deliberately trigger ConnectionLoss to confirm your diagnostics:

  1. Stand up a single Kafka broker pointed at a 3-node ZooKeeper ensemble via zookeeper.connect=zk-01:2181,zk-02:2181,zk-03:2181.
  2. Stop two of the three ZooKeeper nodes so the ensemble loses quorum: sudo systemctl stop zookeeper on two of the hosts.
  3. Restart the Kafka broker (or wait for its next ZooKeeper request). The broker can no longer complete the ephemeral registration under /brokers/ids and logs ConnectionLossException.

Alternatively, leave the ensemble healthy but block the port from the broker to simulate a firewall, which produces the same Client session timed out / ConnectionLoss sequence. Restoring quorum or the network path makes the error disappear, confirming the cause.

Diagnostic Commands

All commands below are read-only — they observe state without changing ZooKeeper or Kafka.

# Is each ZooKeeper node alive and serving? Expect "imok".
echo ruok | nc zk-01 2181
echo ruok | nc zk-02 2181
echo ruok | nc zk-03 2181

# Server stats: mode (leader/follower/standalone), connections, latency.
echo stat | nc zk-01 2181

# Inspect registered brokers and the active controller (read-only ls/get).
zookeeper-shell.sh zk-01:2181 ls /brokers/ids
zookeeper-shell.sh zk-01:2181 get /controller

# Can the broker even reach the Kafka API once it is up?
kafka-broker-api-versions.sh --bootstrap-server localhost:9092

# Follow ZooKeeper and Kafka logs for the connection loss pattern.
journalctl -u zookeeper -n 100 --no-pager
journalctl -u kafka -n 200 --no-pager | grep -i zookeeper

If echo ruok | nc returns nothing (instead of imok), that node is not healthy or is unreachable. If echo stat | nc shows no leader across the ensemble, you have lost quorum. An empty or stale ls /brokers/ids confirms brokers cannot register.

Step-by-Step Resolution

The corrective steps below change service state or configuration — apply them deliberately as the fix.

  1. Restore ZooKeeper quorum first. Identify down nodes with the ruok/stat checks above, then start them:

    sudo systemctl start zookeeper
    sudo systemctl status zookeeper --no-pager | head -5

    Confirm a leader is elected (echo stat | nc <host> 2181 shows Mode: leader on exactly one node).

  2. Fix the network path on port 2181. If a firewall or security group is blocking the broker, allow the broker subnet to reach 2181 (and 2888/3888 between ZK peers). Verify with nc -vz zk-01 2181 from the broker host.

  3. Correct the zookeeper.connect string. In server.properties, ensure it lists the real, resolvable ensemble members, for example:

    zookeeper.connect=zk-01:2181,zk-02:2181,zk-03:2181/kafka

    Watch for a missing/mismatched chroot (/kafka) — that silently sends the broker to the wrong path. Restart the broker after fixing it.

  4. Tune the connection timeout if blips are normal in your environment:

    zookeeper.connection.timeout.ms=18000

    Raise it from the default if you have measured latency or GC pauses approaching the current value.

  5. Ensure an odd number of ZooKeeper nodes (3 or 5) so the ensemble can form quorum and tolerate failures. An even count buys no extra fault tolerance and complicates elections.

  6. Restart the broker and confirm registration:

    sudo systemctl restart kafka
    zookeeper-shell.sh zk-01:2181 ls /brokers/ids

    The broker’s ID should reappear under /brokers/ids.

Prevention and Best Practices

  • Run ZooKeeper as a dedicated 3- or 5-node ensemble on separate hosts from Kafka, so a broker problem cannot starve ZooKeeper of resources.
  • Monitor ensemble health continuously with the four-letter words (ruok, stat, mntr) and alert on loss of a leader or a node count below quorum.
  • Keep zookeeper.connect under configuration management and validate it on deploy so a typo never reaches production.
  • Size zookeeper.connection.timeout.ms and zookeeper.session.timeout.ms against measured network latency and GC behavior — never leave defaults if you see regular reconnects.
  • Tune ZooKeeper’s JVM heap to avoid long GC pauses, and isolate its transaction log on fast, dedicated disks.
  • Plan a migration to KRaft mode, which removes ZooKeeper entirely and eliminates this class of error. New clusters should start on KRaft.
  • For a fast first pass on a live page, the free incident assistant can turn these log lines into a likely cause.
  • SessionExpiredException: KeeperErrorCode = Session expired for /controller — the next stage of severity, where the session is gone rather than just temporarily disconnected, forcing re-registration and possible controller re-election.
  • Unable to connect to ZooKeeper — broker startup failure when no ensemble member is reachable at all.
  • Timed out waiting for connection while in state: CONNECTING — the client never establishes a session within the timeout.
  • KeeperErrorCode = NoNode for /brokers/ids — a chroot or path mismatch, often from a wrong zookeeper.connect. More patterns are collected in the Kafka guides.

Frequently Asked Questions

Is ConnectionLoss the same as a session expiry? No. ConnectionLoss is transient — the client lost the TCP connection mid-request and will retry on the same session. A session expiry means the session itself is dead and all ephemeral nodes (broker registrations) are gone, which is a more disruptive event.

Does this error happen on KRaft clusters? No. KRaft-mode Kafka has no ZooKeeper dependency, so there is no /brokers/ids znode and no ZooKeeper client to lose its connection. This error is specific to ZooKeeper-based clusters.

Will increasing zookeeper.connection.timeout.ms alone fix it? Only if the root cause is genuinely a marginal timeout. If the ensemble has lost quorum or the port is blocked, a higher timeout just delays the same failure. Always restore quorum and the network path first.

Why must the ensemble have an odd number of nodes? Quorum is a strict majority. Three nodes tolerate one failure; five tolerate two. An even count (e.g., four) tolerates the same number of failures as the odd number below it while being more expensive and election-prone, so odd sizing is the standard.

The broker logs ConnectionLoss but recovers on its own — should I act? Frequent transient ConnectionLoss messages are a warning sign of an overloaded ensemble, GC pauses, or a flaky network. Even if the broker recovers, investigate with mntr/stat before it escalates to session expiry and broker churn.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.