Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'Fetch request timed out' Consumer & Replica Fetch Timeout

Fix Kafka 'Fetch request timed out' / request.timeout.ms exceeded on fetch: resolve slow brokers, overlarge fetch sizes, network latency, and replica fetcher stalls.

  • #kafka
  • #troubleshooting
  • #errors
  • #replication

Exact Error Message

A consumer whose fetch did not return within the request timeout logs a disconnect:

[2026-06-29 13:05:44,612] WARN [Consumer clientId=consumer-1, groupId=orders-service] Disconnecting from node 3 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2026-06-29 13:05:44,613] INFO [Consumer clientId=consumer-1, groupId=orders-service] Cancelled in-flight FETCH request with correlation id 8842 due to node 3 being disconnected (elapsed time since creation: 30041ms, request timeout: 30000ms) (org.apache.kafka.clients.NetworkClient)

On a follower broker, the replica fetcher logs the same shape against a leader:

[2026-06-29 13:05:44,701] WARN [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Error in response for fetch request (org.apache.kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.TimeoutException: Fetch request timed out: failed to get response within request.timeout.ms = 30000

What the Error Means

Both consumers and follower brokers read data with fetch requests. A consumer sends a fetch to the partition leader to get new records; a follower replica sends a fetch to the leader to replicate. Either side waits up to request.timeout.ms (default 30000 ms) for a response. If the leader does not respond in time, the requester cancels the in-flight fetch, disconnects from that node, and retries — logging “Fetch request timed out” or “request timeout.”

This is a response-latency problem, not a “no data” situation. A normal idle fetch returns empty after fetch.max.wait.ms (well under the request timeout). A timeout means the broker that should respond is too slow, too busy, or unreachable for longer than the request timeout. For consumers it shows as stalled consumption and reconnects; for replica fetchers it shows as growing under-replicated partitions and replication lag.

Common Causes

  • Overloaded leader broker: The leader is CPU-, disk-, or request-saturated and cannot service the fetch within the timeout.
  • Fetch size too large for the link: A big fetch.max.bytes / max.partition.fetch.bytes (or replica replica.fetch.max.bytes) means a single response can’t transfer within the timeout on a slow link.
  • Network latency or loss: High RTT or packet loss between requester and leader stretches the round trip past request.timeout.ms.
  • Slow disk / page-cache misses: Reads served from disk (cold data, undersized page cache) are far slower than cache hits, delaying fetch responses.
  • Long GC pauses on the leader: Stop-the-world pauses freeze request handling past the timeout.
  • Timeout set too low: A request.timeout.ms smaller than realistic fetch latency causes spurious timeouts even on a healthy cluster.

How to Reproduce the Error

Pair a very large fetch with a tight timeout on a bandwidth-limited path:

# consumer config
request.timeout.ms=5000
fetch.max.bytes=104857600          # 100 MB
max.partition.fetch.bytes=52428800 # 50 MB
fetch.min.bytes=52428800           # force a large response

Asking for up to 100 MB per fetch while only allowing 5 s, across a constrained link, means the response cannot arrive in time and the consumer logs Cancelled in-flight FETCH request ... request timeout: 5000ms. Throttling the leader’s disk or saturating its network reproduces the same timeout under realistic conditions.

Diagnostic Commands

Find fetch timeouts and replica-fetcher errors on the broker:

grep -nE "Fetch request timed out|due to request timeout|Cancelled in-flight FETCH|Error in response for fetch request" /var/log/kafka/server.log | tail -40

Check for under-replicated partitions (the cluster-side symptom of replica fetch timeouts):

kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

Describe the affected topic to see leaders and ISR membership:

kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic orders

Check consumer-group lag to quantify how far behind the timing-out consumers fell:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group orders-service

Look for GC pauses on the leader broker (a common latency source):

grep -niE "Total time for which application threads were stopped|Full GC|pause" /var/log/kafka/gc.log | tail -20

Check the network connections and any latency/load signals to the leader from the requester host:

ss -tinp | grep -E ':9092' | head

Step-by-Step Resolution

  1. Confirm it’s latency, not idleness. “Fetch request timed out” with elapsed time at/over request.timeout.ms means responses are too slow. An empty fetch returning normally is fine — don’t confuse the two.
  2. Identify the slow node. The log names the node/leaderId. Investigate that broker’s CPU, disk I/O, request-handler saturation, and GC.
  3. Relieve broker load. If the leader is overloaded, rebalance leadership/partitions off it, add brokers, or reduce request pressure so fetches complete in time.
  4. Right-size fetch settings. If responses are huge relative to the link, lower fetch.max.bytes / max.partition.fetch.bytes (consumers) or replica.fetch.max.bytes (followers) so a response transfers within the timeout.
  5. Fix GC pauses. Tune the JVM/heap on the leader to eliminate multi-second stop-the-world pauses that stall request handling.
  6. Address the network/disk. Resolve packet loss/latency on the path, and ensure enough page cache / fast disks so reads aren’t pathologically slow.
  7. Only then adjust the timeout. If fetch latency is legitimately high, raise request.timeout.ms to fit — but treat that as a last resort after fixing the underlying slowness. Confirm under-replicated partitions return to zero and consumer lag drains.

Prevention and Best Practices

  • Size fetch.max.bytes/max.partition.fetch.bytes and replica.fetch.max.bytes to your network so a single fetch response always fits comfortably within request.timeout.ms.
  • Keep leadership balanced across brokers so no single node becomes a fetch bottleneck.
  • Provision enough RAM for the OS page cache so most reads are cache hits, not disk reads.
  • Tune the JVM to avoid long GC pauses; pauses near the request timeout cause intermittent fetch timeouts that are hard to chase.
  • Monitor under-replicated partitions and consumer lag; both rise early when replica or consumer fetches start timing out.
  • Set request.timeout.ms to a realistic value for your workload rather than an aggressively low one. For triage of a fetch-timeout incident, the free incident assistant can correlate the timeout logs with broker load.
  • Disconnecting from node N due to request timeout — the NetworkClient-level event behind a fetch timeout.
  • Under-replicated partitions / replication lag — the cluster-side symptom of replica fetcher timeouts.
  • NotLeaderForPartitionException — the requester targeted a broker that is no longer the leader.
  • session timeout expired — a related consumer-side timeout, but on heartbeats/poll rather than fetch.

Frequently Asked Questions

Does this mean there’s no data to consume? No. An idle fetch with no new records returns empty after fetch.max.wait.ms, well within the request timeout. “Fetch request timed out” means the broker that should respond was too slow or unreachable for longer than request.timeout.ms — a latency problem, not an empty-topic situation.

Why do replica fetchers hit this and not just consumers? Follower brokers replicate by fetching from the partition leader using the same fetch mechanism. If the leader is slow, replica fetches time out, ISR shrinks, and partitions become under-replicated — making this an availability issue, not just a client annoyance.

Should I increase request.timeout.ms? Only after ruling out broker overload, oversized fetches, GC, and network problems. A higher timeout hides slowness rather than fixing it and delays detection of a genuinely failing broker. Fix the latency source first.

Could a large message cause this? Yes. A large fetch.max.bytes/max.partition.fetch.bytes lets a single response grow big enough that it cannot transfer within the timeout on a constrained link. Reducing the fetch sizes often resolves timeouts without touching the timeout value.

How do I find which broker is slow? The log names the node or leaderId for the timed-out fetch. Investigate that broker specifically — CPU, disk I/O, request-handler saturation, and GC pauses — rather than treating it as a whole-cluster problem.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.