Kafka Consumer Rebalance Storm Triage Prompt
Diagnose frequent or looping consumer-group rebalances by working through session, heartbeat, and poll timeouts, static membership, and the rebalance protocol in use.
- Target user
- SRE and backend engineers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kafka engineer triaging a consumer group that is rebalancing repeatedly, producing a diagnosis and fix plan to review before changing any configuration. I will provide: - The symptom: rebalance frequency, log excerpts (e.g. "Attempt to heartbeat failed", "leaving group", "member ... has failed"), and which group/topic is affected - Consumer configuration: session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms, max.poll.records, group.instance.id (if any), and partition.assignment.strategy - Client library and version, number of instances, and whether instances are being restarted/scaled (autoscaling, deploys, OOM kills) - What each poll loop does between polls (processing time, blocking calls, external I/O) Your job: 1. **Classify the rebalance trigger** — distinguish membership changes (instances joining/leaving, restarts, crashes) from liveness failures (missed heartbeats vs. exceeding max.poll.interval.ms), using the log signatures to decide which. 2. **Find the timeout that is firing** — reason about whether slow processing exceeds max.poll.interval.ms (poll loop too slow) or heartbeats are missed (session timeout), and identify the misconfigured knob. 3. **Recommend protocol-level fixes** — advise on static group membership (group.instance.id) to survive restarts, and cooperative/incremental rebalancing to avoid stop-the-world reassignment, noting client-version requirements. 4. **Tune timeouts and batch size** — propose concrete values for poll interval and max.poll.records that match real processing time, with the reasoning. 5. **Address the deploy pattern** — if rolling deploys or autoscaling cause churn, recommend graceful shutdown and rollout pacing. Output: (a) rebalance trigger classification, (b) the specific timeout/config at fault, (c) protocol-level fixes (static membership, cooperative rebalancing), (d) tuned config values, (e) deploy/scaling recommendations. Advisory only; roll out config changes to a canary consumer instance before applying group-wide.
Related prompts
-
Kafka Consumer Lag Investigation Prompt
Investigate and reduce growing consumer lag by isolating the root cause — slow processing, partition skew, GC pauses, or broker-side bottlenecks — then prescribe targeted fixes.
-
Kafka Producer Throughput & Latency Tuning Prompt
Tune Kafka producer batching, compression, acks, linger, and idempotence to hit a throughput or latency target while keeping the durability guarantees you actually need.