Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'left group due to expired session timeout' Consumer Drop

Fix Kafka consumers leaving the group on expired session timeout: tune session.timeout.ms and max.poll.interval.ms, cut GC pauses, and fix network and heartbeat stalls.

  • #kafka
  • #troubleshooting
  • #errors
  • #consumer

Exact Error Message

On the broker, the group coordinator logs a member expiring out of the group:

[2026-06-29 11:47:03,221] INFO [GroupCoordinator 1]: Member consumer-orders-1-7c2f9a3e-1b44-4d2e-9a0c-3f6b1e8d5a90 in group orders-service has left group orders-service through explicit `LeaveGroup`; client reason: consumer poll timeout has expired (kafka.coordinator.group.GroupCoordinator)
[2026-06-29 11:47:03,222] INFO [GroupCoordinator 1]: Preparing to rebalance group orders-service in state PreparingRebalance with old generation 42 (reason: removing member consumer-orders-1-... on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)

On the client side you see the matching warning before the rebalance:

[Consumer clientId=consumer-orders-1, groupId=orders-service] Member consumer-orders-1-... sending LeaveGroup request to coordinator broker-1:9092 due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms
[Consumer clientId=consumer-orders-1, groupId=orders-service] Marking the coordinator broker-1:9092 (id: 2147483646) dead because the session has expired

What the Error Means

Kafka consumers stay in a group by proving liveness two ways. A background thread sends heartbeats at least every session.timeout.ms; if the coordinator gets no heartbeat within that window, it considers the member dead and removes it. Separately, your application must call poll() at least every max.poll.interval.ms; if processing a batch takes longer, the consumer proactively leaves the group (“poll timeout has expired”). Either path ends with the member leaving and the group rebalancing — partitions are reassigned to surviving members.

The two triggers look similar but have different fixes. “session has expired” / “heartbeat expiration” means heartbeats stopped (often a stall, GC pause, or network blip). “consumer poll timeout has expired” / “max.poll.interval.ms” means your processing loop took too long between polls.

Common Causes

  • Slow message processing: A batch takes longer than max.poll.interval.ms to process (heavy work per record, downstream calls, large max.poll.records), so the consumer leaves before the next poll.
  • Long GC pauses: Stop-the-world JVM pauses freeze the heartbeat thread past session.timeout.ms.
  • session.timeout.ms too low: A tight session timeout relative to network jitter or heartbeat.interval.ms causes spurious expirations.
  • Network instability: Packet loss or latency between consumer and coordinator drops heartbeats.
  • Coordinator overload or restart: A broker hosting __consumer_offsets partitions is slow or restarting, so heartbeats are not processed in time.
  • Thread starvation: The application blocks the consumer thread (e.g., synchronous I/O on the poll thread) so neither poll nor heartbeat proceeds.

How to Reproduce the Error

Set an aggressive interval and do slow work in the loop:

# consumer config
max.poll.interval.ms=10000
max.poll.records=500
// pseudo-consumer loop
while (true) {
  records = consumer.poll(Duration.ofMillis(100));
  for (r : records) { Thread.sleep(100); /* 500 * 100ms = 50s > 10s */ }
}

Processing 500 records at 100 ms each takes 50 s, well over the 10 s max.poll.interval.ms, so the consumer logs consumer poll timeout has expired and leaves the group, triggering a rebalance.

Diagnostic Commands

Find the expiration and rebalance events on the broker:

grep -nE "has left group|heartbeat expiration|poll timeout has expired|Preparing to rebalance" /var/log/kafka/server.log | tail -40

Inspect the group: members, assignment, and especially lag and the rebalance state:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group orders-service

List groups and watch for any stuck in a rebalancing state:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group orders-service --state

On the consumer host, check for GC pressure in the client/application log:

grep -niE "GC pause|Full GC|stop-the-world|OutOfMemory" /var/log/myapp/consumer.log | tail -20

Confirm the coordinator brokers (hosting __consumer_offsets) are healthy:

kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic __consumer_offsets | head -10

Check basic network reachability/latency to the coordinator from the consumer host:

ss -tnp | grep -E ':9092' | head

Step-by-Step Resolution

  1. Identify which timeout fired. “poll timeout has expired” / max.poll.interval.ms is a processing problem; “session has expired” / “heartbeat expiration” is a liveness/heartbeat problem. The log wording tells you which.
  2. For poll-interval timeouts: Reduce max.poll.records so each batch is processed well within max.poll.interval.ms, and/or raise max.poll.interval.ms to fit genuinely long processing. Move heavy work off the poll thread if possible.
  3. For session/heartbeat timeouts: Investigate GC and network. Tune the JVM (heap, collector) to eliminate long stop-the-world pauses, and confirm stable connectivity to the coordinator.
  4. Right-size the heartbeat settings. Keep heartbeat.interval.ms at roughly one-third of session.timeout.ms, and set session.timeout.ms within the broker-allowed range (group.min.session.timeout.ms/group.max.session.timeout.ms).
  5. Reduce rebalance churn. Use cooperative-sticky assignment and group.instance.id (static membership) so brief absences do not trigger full reassignments.
  6. Verify the coordinator is healthy — if __consumer_offsets brokers are overloaded, fix that before tuning clients.
  7. Redeploy and confirm the group stays stable with no repeated leave/rebalance lines and lag draining normally.

Prevention and Best Practices

  • Size max.poll.records against your real per-record processing time so a full batch always finishes well inside max.poll.interval.ms.
  • Keep heartbeat.interval.mssession.timeout.ms / 3, and avoid setting session.timeout.ms too aggressively low for your network.
  • Tune the JVM to avoid multi-second GC pauses; long pauses are a leading cause of “session expired” drops.
  • Use static group membership (group.instance.id) and cooperative-sticky assignment to minimize rebalances during transient blips and deploys.
  • Never block the poll thread on slow synchronous I/O; offload heavy work or use an async pattern that still polls regularly.
  • Monitor consumer lag and rebalance frequency; a rising rebalance rate is an early warning. For triage, the free incident assistant can map the leave-group logs to a likely cause.
  • CommitFailedException — offset commit rejected because the member was already removed during a rebalance.
  • RebalanceInProgressException — operations failing while the group is mid-rebalance.
  • CoordinatorNotAvailableException — the group coordinator itself is unavailable.
  • Member ... fenced — a stale generation member fenced after rejoining late.

Frequently Asked Questions

What’s the difference between session.timeout.ms and max.poll.interval.ms? session.timeout.ms governs heartbeats from a background thread (liveness). max.poll.interval.ms governs how long your application can go between poll() calls (processing). A member can be dropped for violating either, and the log message tells you which.

Should I just increase the timeouts? Increase max.poll.interval.ms only if your processing legitimately needs more time, and session.timeout.ms only within broker limits. Raising them masks GC or processing problems and slows detection of genuinely dead consumers. Prefer fixing the root cause (batch size, GC, network).

Why does one slow consumer disrupt the whole group? When a member leaves, the group rebalances and partitions are reassigned across all members, briefly pausing consumption for everyone. Frequent drops cause “rebalance storms.” Static membership and cooperative-sticky assignment reduce the blast radius.

Can GC pauses really cause this? Yes — a multi-second stop-the-world pause freezes the heartbeat thread, so the coordinator sees no heartbeat within session.timeout.ms and expires the member. Long GC pauses are one of the most common causes of “session has expired” drops.

How do I confirm it’s the coordinator and not my client? Check the health of the brokers hosting __consumer_offsets. If those brokers are overloaded or restarting, heartbeats aren’t processed in time regardless of client tuning, and multiple groups will show simultaneous expirations.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.