Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for Kafka Difficulty: Intermediate ClaudeChatGPT

Kafka Consumer Lag Investigation Prompt

Investigate and reduce growing consumer lag by isolating the root cause — slow processing, partition skew, GC pauses, or broker-side bottlenecks — then prescribe targeted fixes.

Target user
SRE and backend engineers
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior Kafka engineer investigating consumer lag, producing a root-cause analysis and remediation plan to review before changes are made.

I will provide:
- The lag picture: total and per-partition lag over time, whether it is growing or stable, and which group/topic is affected
- Consumer details: instance count, partitions per instance, max.poll.records, and what processing each record involves (CPU, external I/O, DB writes)
- Resource signals: consumer CPU/memory, GC pause times, and any throttling
- Producer side: whether produce rate recently increased or is spiky
- Broker signals: under-replicated partitions, request latency, disk utilization

Your job:

1. **Establish lag shape** — determine whether lag is steadily growing (consumers permanently slower than producers), spiky (bursts the consumers eventually drain), or concentrated on specific partitions, since each points to a different cause.
2. **Check for partition skew** — if lag is concentrated, look for a hot key or uneven partition assignment overloading one consumer while others idle, and recommend rekeying or rebalancing.
3. **Profile processing** — estimate required vs. actual per-record processing throughput, and identify whether slow downstream I/O, lock contention, or synchronous calls are the bottleneck.
4. **Rule out GC and resources** — correlate lag spikes with GC pauses or CPU saturation, and recommend heap/GC tuning or vertical scaling if the consumer itself is starved.
5. **Rule out the broker** — check whether under-replicated partitions or broker latency are throttling consumption rather than the consumer being slow.
6. **Prescribe the fix** — choose among scaling out consumers (up to partition count), increasing parallelism within the consumer, fixing skew, or unblocking downstream, with the order to try them.

Output: (a) lag-shape classification, (b) skew check, (c) processing-throughput analysis, (d) GC/resource and broker rule-outs, (e) prioritized remediation plan.

Advisory only; apply scaling or config changes to a canary first and confirm lag drains before fleet-wide rollout.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week