AI for Kafka Difficulty: Advanced ClaudeChatGPT

Kafka Consumer Rebalance Storm Triage Prompt

Diagnose frequent or looping consumer-group rebalances by working through session, heartbeat, and poll timeouts, static membership, and the rebalance protocol in use.

Target user: SRE and backend engineers
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior Kafka engineer triaging a consumer group that is rebalancing repeatedly, producing a diagnosis and fix plan to review before changing any configuration.

I will provide:
- The symptom: rebalance frequency, log excerpts (e.g. "Attempt to heartbeat failed", "leaving group", "member ... has failed"), and which group/topic is affected
- Consumer configuration: session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms, max.poll.records, group.instance.id (if any), and partition.assignment.strategy
- Client library and version, number of instances, and whether instances are being restarted/scaled (autoscaling, deploys, OOM kills)
- What each poll loop does between polls (processing time, blocking calls, external I/O)

Your job:

1. **Classify the rebalance trigger** — distinguish membership changes (instances joining/leaving, restarts, crashes) from liveness failures (missed heartbeats vs. exceeding max.poll.interval.ms), using the log signatures to decide which.
2. **Find the timeout that is firing** — reason about whether slow processing exceeds max.poll.interval.ms (poll loop too slow) or heartbeats are missed (session timeout), and identify the misconfigured knob.
3. **Recommend protocol-level fixes** — advise on static group membership (group.instance.id) to survive restarts, and cooperative/incremental rebalancing to avoid stop-the-world reassignment, noting client-version requirements.
4. **Tune timeouts and batch size** — propose concrete values for poll interval and max.poll.records that match real processing time, with the reasoning.
5. **Address the deploy pattern** — if rolling deploys or autoscaling cause churn, recommend graceful shutdown and rollout pacing.

Output: (a) rebalance trigger classification, (b) the specific timeout/config at fault, (c) protocol-level fixes (static membership, cooperative rebalancing), (d) tuned config values, (e) deploy/scaling recommendations.

Advisory only; roll out config changes to a canary consumer instance before applying group-wide.

Kafka Consumer Rebalance Storm Triage Prompt

Related prompts

Kafka Consumer Lag Investigation Prompt

Kafka Producer Throughput & Latency Tuning Prompt

Related prompts

Kafka Consumer Lag Investigation Prompt

Kafka Producer Throughput & Latency Tuning Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet