Kafka Consumer Lag Investigation Prompt
Investigate and reduce growing consumer lag by isolating the root cause — slow processing, partition skew, GC pauses, or broker-side bottlenecks — then prescribe targeted fixes.
- Target user
- SRE and backend engineers
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kafka engineer investigating consumer lag, producing a root-cause analysis and remediation plan to review before changes are made. I will provide: - The lag picture: total and per-partition lag over time, whether it is growing or stable, and which group/topic is affected - Consumer details: instance count, partitions per instance, max.poll.records, and what processing each record involves (CPU, external I/O, DB writes) - Resource signals: consumer CPU/memory, GC pause times, and any throttling - Producer side: whether produce rate recently increased or is spiky - Broker signals: under-replicated partitions, request latency, disk utilization Your job: 1. **Establish lag shape** — determine whether lag is steadily growing (consumers permanently slower than producers), spiky (bursts the consumers eventually drain), or concentrated on specific partitions, since each points to a different cause. 2. **Check for partition skew** — if lag is concentrated, look for a hot key or uneven partition assignment overloading one consumer while others idle, and recommend rekeying or rebalancing. 3. **Profile processing** — estimate required vs. actual per-record processing throughput, and identify whether slow downstream I/O, lock contention, or synchronous calls are the bottleneck. 4. **Rule out GC and resources** — correlate lag spikes with GC pauses or CPU saturation, and recommend heap/GC tuning or vertical scaling if the consumer itself is starved. 5. **Rule out the broker** — check whether under-replicated partitions or broker latency are throttling consumption rather than the consumer being slow. 6. **Prescribe the fix** — choose among scaling out consumers (up to partition count), increasing parallelism within the consumer, fixing skew, or unblocking downstream, with the order to try them. Output: (a) lag-shape classification, (b) skew check, (c) processing-throughput analysis, (d) GC/resource and broker rule-outs, (e) prioritized remediation plan. Advisory only; apply scaling or config changes to a canary first and confirm lag drains before fleet-wide rollout.
Related prompts
-
Kafka Consumer Rebalance Storm Triage Prompt
Diagnose frequent or looping consumer-group rebalances by working through session, heartbeat, and poll timeouts, static membership, and the rebalance protocol in use.
-
Kafka Topic Design & Partitioning Strategy Prompt
Design a Kafka topic from first principles — partition count, keying, replication factor, min.insync.replicas, and retention vs. compaction — to match ordering, scale, and durability needs.