Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 9 min read

Kafka Error Guide: 'CommitFailedException' Offset Commit Cannot Be Completed

Fix Kafka CommitFailedException when the consumer falls out of an active group: diagnose slow processing, max.poll.interval.ms, and rebalance-driven commit rejection.

  • #kafka
  • #troubleshooting
  • #errors
  • #consumer

Exact Error Message

org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:1226)
	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:1093)
	at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1517)
	at com.example.orders.OrderConsumer.run(OrderConsumer.java:78)

You will often see a preceding WARN from the consumer that names the real trigger:

WARN  o.a.k.c.c.internals.ConsumerCoordinator - [Consumer clientId=order-worker-3, groupId=order-processing]
  consumer poll timeout has expired. This means the time between subsequent calls to poll()
  was longer than the configured max.poll.interval.ms, which typically implies that the poll
  loop is spending too much time processing messages. You can address this either by increasing
  max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

What the Error Means

CommitFailedException is thrown when a consumer tries to commit offsets but the group coordinator no longer considers it a member of the active group. Between the previous poll() and the commit, the consumer’s membership was revoked, a rebalance occurred, and the partitions it was about to commit were reassigned to a different member. Committing for partitions you no longer own would corrupt another consumer’s progress, so the broker rejects the commit and the client surfaces this exception.

The overwhelmingly common trigger is exceeding max.poll.interval.ms (default 300000 ms / 5 minutes). The consumer’s poll loop must call poll() at least once per interval. If your processing of a single batch takes longer than that, the coordinator assumes the consumer is dead, evicts it, and starts a rebalance. When your code finally finishes and calls commitSync(), the membership is already gone.

Common Causes

  • Processing slower than the poll interval. A batch of max.poll.records messages takes longer to process than max.poll.interval.ms, so the heartbeat thread stops renewing membership eligibility and the coordinator evicts the consumer.
  • max.poll.records too high. Pulling 500 records and doing a slow downstream call (database write, HTTP request) per record can easily blow past 5 minutes.
  • A single poison-pill record stalls the loop. One message triggers a long retry/backoff or an external timeout, freezing the whole batch.
  • GC pauses or thread starvation. Long stop-the-world pauses delay the next poll() call past the interval.
  • max.poll.interval.ms set too low for legitimately heavy processing.
  • Manual commit after a known rebalance — committing in a ConsumerRebalanceListener or after partitions were revoked.

How to Reproduce the Error

Set a deliberately short poll interval and sleep longer than it inside the loop:

props.put(ConsumerConfig.GROUP_ID_CONFIG, "order-processing");
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 10000); // 10s
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 50);

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
    for (ConsumerRecord<String, String> r : records) {
        Thread.sleep(1000); // 50 records * 1s = 50s > 10s interval
    }
    consumer.commitSync(); // throws CommitFailedException
}

The first batch processes for ~50 seconds, the coordinator evicts the member after 10 seconds, and commitSync() throws.

Diagnostic Commands

Confirm the group is rebalancing or has churning members. All commands below are read-only.

# Group state and coordinator — look for PreparingRebalance / CompletingRebalance
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processing --state
# Per-member assignment and client host — repeated runs show members coming and going
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processing --describe --members --verbose
# Lag per partition — growing lag confirms the group cannot keep up
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processing --describe
# Inspect the consumer's effective interval/records config in app logs at startup
grep -E "max.poll.interval.ms|max.poll.records|session.timeout.ms" /var/log/order-worker/app.log
# Look for the poll-timeout WARN that precedes the exception
journalctl -u order-worker --since "1 hour ago" | grep -i "poll timeout has expired"

Step-by-Step Resolution

  1. Confirm the trigger. Find the “poll timeout has expired” WARN in the consumer log immediately before the exception. Its presence means max.poll.interval.ms was exceeded; its absence points to a different rebalance cause (deploy, scaling).

  2. Reduce batch size first. Lower max.poll.records (for example from 500 to 50) so each poll loop iteration finishes well inside the interval. This is the safest fix because it does not extend how long a dead consumer is tolerated.

    props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 50);
  3. Raise max.poll.interval.ms only if processing is legitimately long. If a batch genuinely needs more time, increase it (for example to 600000). Be aware this also increases how long the group waits before reassigning a truly stuck consumer.

  4. Move slow work off the poll thread. For heavy per-record processing, hand records to a worker pool and use consumer.pause() / consumer.resume() so poll() keeps being called (and membership stays alive) while processing proceeds asynchronously.

  5. Handle the exception gracefully. Catch CommitFailedException, do not treat the in-flight batch as committed, and let the next poll() rejoin the group. The records will be redelivered to whichever member now owns the partitions, so your processing must be idempotent.

  6. Verify recovery. Re-run --state and confirm the group is Stable, then watch lag trend downward.

Prevention and Best Practices

  • Keep per-loop processing time comfortably under max.poll.interval.ms; size max.poll.records against your slowest realistic per-record cost.
  • Make message handling idempotent so redelivery after eviction is harmless.
  • Offload long or unbounded work (external calls, retries) to a separate executor and use pause/resume to keep the poll loop heartbeating.
  • Add a metric on the time between poll() calls and alert before it approaches the interval.
  • Avoid unbounded retry loops inside the consumer thread; cap retries and dead-letter poison records.
  • For a fast read on a specific eviction, the free incident assistant can turn the stack trace and group state into a likely cause.
  • RebalanceInProgressException — a commit raced an in-progress rebalance; closely related but recoverable by retrying the poll cycle.
  • IllegalGenerationException — the commit used a stale group generation after a rebalance completed.
  • UnknownMemberIdException — the coordinator no longer recognizes the member id, often the layer beneath a CommitFailedException.
  • WakeupException — unrelated to commits; raised by consumer.wakeup() for shutdown.

Frequently Asked Questions

Is it safe to ignore CommitFailedException? Yes, as long as your processing is idempotent. The exception means the just-processed batch was not committed and will be redelivered. Do not advance application state as if the commit succeeded.

Should I just keep increasing max.poll.interval.ms? No. A very large interval delays detection of genuinely stuck consumers, so a hung worker holds its partitions for minutes. Fix processing speed and batch size first; raise the interval only when the work is legitimately long.

Does this affect heartbeats? max.poll.interval.ms is separate from session.timeout.ms/heartbeat.interval.ms. Heartbeats run on a background thread and keep the session alive; the poll interval governs liveness of your processing loop. Exceeding the poll interval evicts you even while heartbeats are healthy.

Why does lag spike when this happens? Eviction triggers a rebalance, partitions pause during reassignment, and the uncommitted batch is reprocessed, so consumed-but-uncommitted offsets reappear as lag until the group restabilizes.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.