Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kafka By James Joyner IV · · 11 min read

Debugging Kafka Consumer Lag with AI

Measure Kafka consumer lag correctly, find the real root cause with AI-assisted analysis, and apply durable fixes — from poison messages to under-provisioned groups.

  • #kafka
  • #consumer-lag
  • #ai
  • #troubleshooting
  • #performance
  • #sre

Consumer lag is the metric every Kafka operator watches and the one most often misread. Lag is the difference between the latest offset written to a partition (the log-end offset) and the offset a consumer group has committed. A small, stable lag is healthy. A growing lag means your consumers are falling behind the producers, and left unchecked it ends in breached SLAs, full disks from retained data, or a cascade of timeouts. The hard part is never seeing the lag — kafka-consumer-groups.sh shows it in one command. The hard part is explaining why it is growing and which fix actually addresses the cause. This guide walks through measuring lag correctly, using AI to narrow the root cause, and applying durable fixes.

Measuring lag correctly before you diagnose anything

The single most important habit is to measure lag as a rate, not a snapshot. A group sitting at 50,000 messages of lag that is shrinking is fine. The same number growing by 5,000 per second is an incident. Start with the canonical command:

kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group orders-enricher

The output gives one row per partition:

GROUP           TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID         HOST
orders-enricher orders  0          1048120         1052310         4190   consumer-1-a1b2c3   /10.0.1.4
orders-enricher orders  1          1047995         1098770         50775  consumer-1-a1b2c3   /10.0.1.4
orders-enricher orders  2          1051200         1051640         440    consumer-2-d4e5f6   /10.0.1.5

Three things to read here immediately:

  • Lag is uneven across partitions. Partition 1 has 50,775 lag while partition 2 has 440. That points to a hot partition or skewed keys, not a uniformly slow group.
  • One consumer owns multiple partitions. consumer-1 owns partitions 0 and 1. If the group has fewer consumers than partitions, some consumers are doing double duty.
  • An empty CONSUMER-ID column means no member is assigned — the group may be rebalancing or has fewer members than it should.

To get the rate, capture the same output twice with a gap and compute the delta:

kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group orders-enricher > /tmp/lag-t0.txt
sleep 30
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group orders-enricher > /tmp/lag-t1.txt
diff /tmp/lag-t0.txt /tmp/lag-t1.txt

Pro Tip: Feed both snapshots to an AI assistant and ask it to compute per-partition lag velocity and flag any partition whose consumed offset did not advance at all between the two captures. A stalled offset on a single partition is the classic fingerprint of a poison message.

The root causes AI helps you separate

The value of AI in lag debugging is disambiguation. The same growing-lag symptom has at least five distinct causes, and they call for opposite fixes. Give the model the lag snapshots, the partition count, the consumer count, and recent broker metrics, and it can rank which cause fits the evidence.

Root causeTelltale signalWrong fix that wastes time
Slow consumer processingLag grows uniformly across all partitionsAdding partitions (no extra consumers to use them)
Hot/skewed partitionLag grows on one or two partitions onlyScaling the whole group up
Under-provisioned groupFewer consumers than partitions, all busyTuning poll settings
Poison messageOne partition’s consumed offset frozenRestarting consumers (they stall again)
Producer spikeLog-end offset accelerating, consumers fineTouching consumer config at all

Slow consumer processing

If every partition’s lag grows at a similar rate and consumers are healthy, the processing logic itself is too slow — an expensive database call per message, a synchronous external API, or unbatched writes. The fix is making the work faster or more parallel, not adding partitions you have no consumers to read from.

Hot or skewed partitions

When lag concentrates on one or two partitions, the producer is keying records unevenly so most traffic lands on a few partitions. No amount of consumer scaling helps, because a single partition is consumed by a single member within a group. The fix lives on the producer side: a better partition key or a higher partition count with rebalanced keys.

Under-provisioned consumer groups

Kafka assigns each partition to exactly one consumer in a group. If you have 12 partitions and 4 consumers, each consumer handles 3 partitions. If 4 consumers cannot keep up, you can scale to at most 12 consumers — beyond that, extra consumers sit idle. Check the math first:

# Partition count for the topic
kafka-topics.sh --bootstrap-server kafka:9092 --describe --topic orders | grep PartitionCount

Poison messages

A poison message is a record your consumer cannot process — it throws on deserialization or hits a code path that loops or crashes. The partition’s consumed offset freezes while the log-end offset keeps climbing, so that one partition’s lag rises steeply and the others stay flat. Restarting the consumer just stalls on the same offset again. The durable fix is error handling that routes failed records to a dead letter topic and advances the offset.

Producer spikes

Sometimes consumers are fine and producers simply got faster — a batch job, a traffic surge, a backfill. Here the log-end offset is accelerating while consumed offset advances at its normal healthy rate. Touching consumer config does nothing useful; you either let the group catch up or temporarily scale it out.

Applying durable fixes

Once AI has narrowed the cause, the fixes fall into a few categories. Apply the one that matches the diagnosis — not the one that is easiest to reach for.

Scale the consumer group (when partitions allow)

If you are under-provisioned and have spare partitions, add consumers. With Kubernetes, scale the deployment; the group rebalances automatically. Remember the ceiling: you cannot have more active consumers in a group than partitions.

Tune consumer fetch and poll settings

For throughput-bound consumers, larger fetches reduce per-batch overhead:

# consumer.properties — pull more per round trip
max.poll.records=1000
fetch.min.bytes=1048576
fetch.max.wait.ms=500
max.partition.fetch.bytes=10485760
# Give slow processing more headroom before triggering a rebalance
max.poll.interval.ms=600000

Raising max.poll.records lets each poll() return more work, while fetch.min.bytes lets the broker accumulate data before responding, cutting request overhead. Be careful: if you raise max.poll.records, make sure you can process the batch within max.poll.interval.ms, or you will trigger the very rebalances you are trying to avoid.

Add partitions to raise the parallelism ceiling

If the group is maxed at one consumer per partition and still behind, increase the partition count so you can add more consumers:

kafka-topics.sh --bootstrap-server kafka:9092 \
  --alter --topic orders --partitions 24

Two warnings. Adding partitions changes key-to-partition mapping for keyed topics, which breaks per-key ordering for keys that move. And you can only increase partition count, never decrease it, so size with headroom rather than chasing the number repeatedly.

Handle poison messages with a dead letter pattern

Wrap record processing so a record that fails repeatedly is published to a dead letter topic and its offset committed, letting the partition advance. This is application-level work, but it is the only fix that actually unsticks a frozen partition.

Pro Tip: Never reach for --reset-offsets --to-latest to “fix” lag in production. It makes the dashboard green by skipping unprocessed messages — that is data loss, not a fix. Reserve offset resets for genuine reprocessing scenarios and always run with --dry-run first.

Watching lag continuously

Manual --describe runs are fine for an incident, but you want lag on a dashboard before the incident. Export consumer group lag into Prometheus with a lag exporter and alert on rate of change, not absolute value. A good alert fires when lag has grown continuously for several minutes, which filters out the harmless spikes that self-correct. From there, an AI summarization layer can turn the alert plus the recent metric window into a first-pass diagnosis automatically, so the on-call engineer opens the page already knowing which of the five causes is most likely.

Key takeaways

PointDetails
Measure rate, not snapshotsCapture lag twice and compute velocity; a shrinking 50k lag is healthy.
Read the per-partition shapeUniform growth, single-partition growth, and frozen offsets each mean a different cause.
Match the fix to the causeScaling, poll tuning, partition increases, and dead-letter handling fix different problems.
Mind the parallelism ceilingA group cannot have more active consumers than the topic has partitions.
Never reset offsets to hide lag--reset-offsets --to-latest is silent data loss, not a fix.

What actually moves the needle on lag

After enough lag incidents, the pattern I trust most is the per-partition shape of the lag. The instinct under pressure is to scale the consumer group, and it is wrong about half the time. If lag is piled onto one partition, scaling does nothing because a single partition is read by a single consumer in the group. The fix is on the producer’s keying, not the consumer’s replica count. AI has been genuinely useful here precisely because it forces the question “is this uniform or concentrated?” before anyone touches a deployment.

The other lesson I keep relearning is that offset resets are a trap. They make the chart green and the problem invisible, which is the worst possible outcome for anything carrying real data. I would rather page someone and let the group grind through the backlog than quietly skip records.

My read: AI is an excellent lag triage partner because the diagnosis is mostly disambiguation, which is exactly what a model given good structured input does well. Keep the destructive operations human-gated and you have a fast, safe loop.

— James

Build your AI Kafka workflow with DevOps AI ToolKit

DevOps AI ToolKit publishes prompts and automation guides for engineers running production streaming systems. Browse the full AI prompt library for prompts that help you triage consumer lag, summarize metric windows, and turn alerts into first-pass diagnoses.

FAQ

What is Kafka consumer lag?

It is the difference between a partition’s log-end offset (latest produced message) and the offset a consumer group has committed. Growing lag means consumers are falling behind producers.

Why does adding consumers not always reduce lag?

A partition is consumed by exactly one member within a group. If lag is concentrated on a single hot partition, or if you already have one consumer per partition, adding more consumers leaves them idle and changes nothing.

How do I spot a poison message causing lag?

Look for a single partition whose consumed offset stops advancing while its log-end offset keeps climbing. That frozen offset is the signature of a record the consumer cannot process.

Should I reset offsets to fix lag?

No. Resetting to latest skips unprocessed messages, which is data loss disguised as a fix. Use offset resets only for deliberate reprocessing, and always run with --dry-run first.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.