Kafka Error Guide: 'NetworkException: The server

Exact Error Message

org.apache.kafka.common.errors.NetworkException: The server disconnected before a
  response was received.

# Surrounding client logs that frame it:
[2026-06-29 14:22:08,455] WARN [Producer clientId=orders-svc] Connection to node 3
  (broker-3.kafka.svc:9092/10.0.4.31:9092) terminated during authentication.
  This may happen due to any of the following reasons: (1) the broker is not up,
  (2) ... (org.apache.kafka.common.network.Selector)
[2026-06-29 14:22:08,461] WARN [Producer clientId=orders-svc] Received invalid
  metadata error in produce request on partition orders-5 due to
  org.apache.kafka.common.errors.NetworkException: The server disconnected before a
  response was received.; will retry (org.apache.kafka.clients.producer.internals.Sender)

What the Error Means

NetworkException means exactly what it says: the TCP connection between client and broker dropped after the request was sent but before a response came back. The producer cannot know whether the broker processed the request, so from the client’s perspective the outcome is unknown, not failed.

Crucially, it is a retriable exception. The Kafka client treats it as transient and, by default, refreshes metadata and retries on another connection. You usually only see it bubble up to application code when retries are exhausted or disabled. That makes NetworkException fundamentally different from InvalidRecordException or RecordTooLargeException, where the broker deliberately said no — here, the broker never got a chance to answer.

The “unknown outcome” property is why idempotence matters: without it, a retry of a request that actually succeeded server-side would create a duplicate.

Common Causes

Rolling restart / broker bounce. When a broker shuts down for an upgrade, in-flight connections to it are severed. Clients see NetworkException until they discover the new leader via a metadata refresh.
Broker OOM-kill or crash. A broker killed by the OOM killer or a hard JVM crash drops every connection instantly and without a FIN handshake.
Long GC pause. A stop-the-world GC pause on the broker can stall it past request.timeout.ms or trip connections.max.idle.ms, after which the connection is torn down and the client observes a disconnect.
Idle connection close. Kafka closes connections that have been idle longer than connections.max.idle.ms (default 10 minutes). A bursty producer that goes quiet, then sends, can hit a half-closed socket.
Load balancer or proxy in front of the brokers. Putting an L4 LB or proxy between clients and brokers is a classic foot-gun. If advertised.listeners points clients back through the LB and the LB has its own idle/connection timeouts — or routes a connection to the wrong broker — connections get dropped mid-request. Misconfigured advertised.listeners is the single most common structural cause.

How to Reproduce the Error

The most controllable reproduction is the idle-close path: configure a short broker idle timeout, open a producer, sit idle past it, then send.

Properties p = new Properties();
p.put("bootstrap.servers", "broker:9092");
p.put("acks", "all");
p.put("retries", 0);                    // disable retries so the exception surfaces
p.put("enable.idempotence", "false");
p.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
p.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

try (Producer<String, String> producer = new KafkaProducer<>(p)) {
    producer.send(new ProducerRecord<>("orders", "k", "v")).get();
    Thread.sleep(Duration.ofMinutes(11).toMillis());   // exceed connections.max.idle.ms
    // The broker has closed the socket; this send observes the disconnect:
    producer.send(new ProducerRecord<>("orders", "k2", "v2")).get();
}

Alternatively, run a kafka-server-stop.sh/restart on the leader broker for a partition while a tight produce loop is running (in a disposable test cluster) and watch the WARN logs.

Diagnostic Commands

Confirm broker reachability, listener advertisement, and whether a restart or GC pause lines up with the disconnects. All read-only.

# Does the broker answer at all, and what API versions does it advertise?
kafka-broker-api-versions.sh --bootstrap-server broker:9092

# Inspect advertised.listeners / listeners — the usual structural culprit
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type brokers --entity-name 3 --describe | grep -i listener

# Check the broker idle-timeout setting
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type brokers --entity-name 3 --describe | grep -i "connections.max.idle.ms"

# Correlate disconnects with broker restarts or GC pauses
journalctl -u kafka --since "30 min ago" | grep -iE "shutting down|started|Out of memory|killed"

# Look for long GC pauses in the broker GC log
grep -iE "Total time for which application threads were stopped" /var/log/kafka/gc.log | tail

# Confirm the topic's leadership so you know which broker the client was talking to
kafka-topics.sh --bootstrap-server broker:9092 --describe --topic orders

A common smell is kafka-broker-api-versions.sh succeeding from your jump host but the producer still failing — that mismatch usually means advertised.listeners hands clients an address that only works from one network path.

Step-by-Step Resolution

Decide if it is transient or structural. If the NetworkException bursts line up with a deploy or restart in journalctl, it is transient — the right fix is making the client retry cleanly, not chasing a phantom outage.
For the transient case, configure the producer to absorb it. The error is retriable, so let the client retry and dedupe:
```
enable.idempotence=true
acks=all
retries=2147483647
max.in.flight.requests.per.connection=5
request.timeout.ms=30000
retry.backoff.ms=100
delivery.timeout.ms=120000
```
With idempotence on, retries of an ambiguous send cannot produce duplicates even if the original request actually committed. max.in.flight.requests.per.connection=5 is safe under idempotence because the broker deduplicates by sequence number.
For the GC-pause case, fix the broker, not the client. Tune broker heap and use G1/ZGC appropriately; sustained pauses longer than request.timeout.ms will keep generating disconnects no matter how the client is tuned.
For the proxy / advertised.listeners case, this is the structural fix that actually ends the problem. Brokers must advertise addresses that clients can reach directly, and any proxy must preserve Kafka’s broker-aware routing. Verify advertised.listeners resolves and connects from the client’s network, and ensure the proxy’s idle timeout exceeds the client’s expected idle period.
For idle-close, either keep the connection warm with a steadier produce cadence or accept that the first send after a long idle may retry once — which idempotent retries handle transparently.

Prevention and Best Practices

Always run producers with enable.idempotence=true. It converts the inherent ambiguity of a disconnect into a safe, duplicate-free retry.
Keep retries high with a bounded delivery.timeout.ms rather than failing fast; transient disconnects during deploys should never reach application code.
Get advertised.listeners right from day one. Most “flaky Kafka network” tickets are a listener/proxy misconfiguration, not a network fault.
Perform rolling restarts one broker at a time, waiting for ISR to stabilize, so only a small fraction of connections churn at once.
Monitor broker GC pause time and OOM events; both manifest first as client-side NetworkException spikes. An incident assistant can correlate a producer’s disconnect burst with the broker restart or GC pause that caused it.
Set connections.max.idle.ms consistently across brokers and any intermediary proxy so the proxy never closes a connection the broker still considers live.

TimeoutException — the close cousin: where NetworkException is an observed disconnect, a TimeoutException is the absence of any response within request.timeout.ms/delivery.timeout.ms. Broker instability frequently produces both.
NotEnoughReplicasException — during a rolling restart you may see disconnects (NetworkException) and ISR shrink rejections (NotEnoughReplicasException) in the same window.
UnknownProducerIdException and OutOfOrderSequenceException — both can surface when idempotent producers retry across connection churn and the producer-id/sequence state is lost or reordered.

The broader Kafka guides cover these failure modes together.

Frequently Asked Questions

Is NetworkException safe to retry? Yes — it is a retriable exception and the client retries it automatically by default. The only caution is duplicates: because the request outcome is unknown, a retry of a send that actually committed can duplicate unless enable.idempotence=true. Turn idempotence on and retries are both safe and free of duplicates.

Why does it appear right after a deploy or restart? Restarting a broker severs its open connections. Clients connected to that broker see the disconnect, refresh metadata, and reconnect to the new leader. A short burst of NetworkException around every rolling restart is expected; it is only a problem if retries are disabled or the bursts are large and prolonged.

My producer fails but kafka-broker-api-versions.sh works fine — why? That mismatch almost always points at advertised.listeners. The broker advertises an address that works from your admin host but is unreachable (or routes through a misconfigured proxy) from the producer’s network. Check the advertised address from the producer’s vantage point, not the bastion.

Can a load balancer in front of Kafka cause this? Yes, and it commonly does. Kafka clients need to address individual brokers by their advertised hostnames; an L4 load balancer that hides brokers behind one VIP, or that imposes its own idle timeout, will drop connections mid-request. If you must front Kafka, the proxy has to preserve per-broker routing and out-live the client idle window.

How is this different from TimeoutException? NetworkException means the connection was actively dropped before a reply arrived. TimeoutException means no reply arrived within the configured timeout, with the connection possibly still open. Both are retriable and often co-occur during broker stress, but they point at slightly different root causes — a torn socket versus a slow or stalled broker.

Kafka Error Guide: 'NetworkException: The server disconnected before a response was received' Server Disconnected Before Response

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Related Errors

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit