Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for Kafka Difficulty: Advanced ClaudeChatGPT

Kafka Consumer Rebalance Storm Triage Prompt

Diagnose frequent or looping consumer-group rebalances by working through session, heartbeat, and poll timeouts, static membership, and the rebalance protocol in use.

Target user
SRE and backend engineers
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Kafka engineer triaging a consumer group that is rebalancing repeatedly, producing a diagnosis and fix plan to review before changing any configuration.

I will provide:
- The symptom: rebalance frequency, log excerpts (e.g. "Attempt to heartbeat failed", "leaving group", "member ... has failed"), and which group/topic is affected
- Consumer configuration: session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms, max.poll.records, group.instance.id (if any), and partition.assignment.strategy
- Client library and version, number of instances, and whether instances are being restarted/scaled (autoscaling, deploys, OOM kills)
- What each poll loop does between polls (processing time, blocking calls, external I/O)

Your job:

1. **Classify the rebalance trigger** — distinguish membership changes (instances joining/leaving, restarts, crashes) from liveness failures (missed heartbeats vs. exceeding max.poll.interval.ms), using the log signatures to decide which.
2. **Find the timeout that is firing** — reason about whether slow processing exceeds max.poll.interval.ms (poll loop too slow) or heartbeats are missed (session timeout), and identify the misconfigured knob.
3. **Recommend protocol-level fixes** — advise on static group membership (group.instance.id) to survive restarts, and cooperative/incremental rebalancing to avoid stop-the-world reassignment, noting client-version requirements.
4. **Tune timeouts and batch size** — propose concrete values for poll interval and max.poll.records that match real processing time, with the reasoning.
5. **Address the deploy pattern** — if rolling deploys or autoscaling cause churn, recommend graceful shutdown and rollout pacing.

Output: (a) rebalance trigger classification, (b) the specific timeout/config at fault, (c) protocol-level fixes (static membership, cooperative rebalancing), (d) tuned config values, (e) deploy/scaling recommendations.

Advisory only; roll out config changes to a canary consumer instance before applying group-wide.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week