Skip to content
DevOps AI ToolKit
Newsletter
All guides
GCP with AI By James Joyner IV · · 10 min read

Debugging Pub/Sub With AI: Delivery, Ordering, and Dead Letters

Pub/Sub duplicates, lost messages, and growing backlogs trace back to its delivery semantics. Here's how I use AI to match the symptom to the real cause and fix it.

  • #gcp
  • #ai
  • #pubsub
  • #event-driven
  • #debugging

A service kept processing the same order twice. Not often — maybe one in a few thousand — but enough that finance noticed duplicate charges. The team’s first theory was a publisher bug, then a database race, then a “Pub/Sub is sending duplicates” shrug. The actual cause was that the consumer took about 70 seconds to process a message while the subscription’s ack deadline was 60, so Pub/Sub assumed the message was lost and redelivered it. The system was behaving exactly as designed — at-least-once delivery — and the bug was in expecting otherwise. Almost every Pub/Sub incident I’ve debugged is some version of this: a mismatch between what the platform guarantees and what the consumer assumes. AI is good at closing that gap fast, because the platform’s semantics are precise even when they’re surprising.

Pull the config and the metrics first

The temptation is to start reading consumer code. Don’t — start with what the subscription is actually configured to do and what its metrics show. The config tells you the ack deadline, retry policy, and whether ordering and dead-lettering are even enabled; the metrics tell you whether messages are piling up or being redelivered.

gcloud pubsub subscriptions describe orders-sub --format=yaml

Prompt: “Here’s a Pub/Sub subscription config and the symptom: occasional duplicate processing. The consumer takes about 70 seconds per message. Walk through the delivery semantics and tell me whether the ack deadline explains the duplicates. If so, give me the options — extend the deadline, use lease extension in the client, or make the consumer idempotent — and which you’d reach for first.”

The model lands on the ack-deadline mismatch immediately because the arithmetic is unambiguous: processing time exceeds the deadline, so redelivery is guaranteed. What’s useful is that it doesn’t stop at “raise the deadline.” It points out that the durable fix is an idempotent consumer, because at-least-once means duplicates are always possible and fighting them is a losing game.

Classify the symptom against the semantics

Each Pub/Sub symptom maps to a specific guarantee, and naming the mapping is most of the diagnosis. I keep the model anchored to that table:

  • Duplicates — missed-ack redelivery (deadline too short, or no lease extension) or a non-idempotent consumer.
  • Missing messages — expiration (retention or expiration policy too short) or messages routed to a dead-letter topic.
  • Out of order — ordering isn’t actually enabled, or the publisher isn’t setting an ordering key.
  • Growing backlog — consume rate is below publish rate, or a nack loop is reprocessing the same messages.

Prompt: “Our subscriber consumes messages out of order. We assumed Pub/Sub preserves order. Here’s the subscription config and the publisher code. Check whether message ordering is enabled on the subscription AND whether the publisher sets an ordering key, and explain what ordering Pub/Sub actually guarantees versus what we assumed.”

This catches the silent failure where a team relies on ordering they never configured. Ordering only holds within an ordering key on an ordering-enabled subscription, and never across keys — so a consumer assuming global order is broken by design, not by a bug.

Dead letters: stop the poison message

A message that always fails processing gets redelivered forever unless you give it somewhere to go. Worse, on an ordered subscription, a poison message can block its entire ordering key. The fix is a dead-letter topic with a delivery-attempt limit.

gcloud pubsub subscriptions update orders-sub \
  --dead-letter-topic=orders-dlq \
  --max-delivery-attempts=5

Prompt: “I’m adding a dead-letter topic to this subscription. Beyond the update command, what IAM does the Pub/Sub service agent need to publish to the DLQ and to ack from the source subscription? Give me the exact bindings, and tell me how to monitor the DLQ so messages landing there get noticed.”

That IAM detail trips everyone up: the Pub/Sub service agent — not your app — needs publisher rights on the dead-letter topic and subscriber rights on the source, or dead-lettering silently does nothing.

Draining a backlog without making it worse

When num_undelivered_messages is climbing, the question is whether to scale consumers, raise flow-control limits, or fix a nack loop that’s reprocessing the same messages. I have the model read the metric shape before touching anything, because scaling consumers into a nack loop just burns money faster.

Prompt: “Our backlog is growing and oldest_unacked_message_age keeps rising. Ack and nack counts are both high. Here are the flow-control settings. Help me tell whether this is genuine under-provisioning or a nack loop reprocessing failures, and what to change for each case.”

The honest division of labor

AI is fast at the part that matters most here: mapping a confusing symptom to the precise delivery semantic behind it, and producing the exact config or IAM change to fix it. Those semantics are well-defined, which is why the model is reliable on them. What it can’t do is decide whether a backlog is safe to drop or whether a message in the DLQ represents money that has to be recovered by hand. So I never let it shorten a retention or expiration policy without confirming the backlog is truly disposable — that’s how you lose messages permanently.

These prompts live in reusable form in my prompts library, and the GCP with AI series covers the surrounding pieces an event-driven incident pulls in, like the Cloud Run failure debugging you’ll need when the consumer itself is the problem. Pub/Sub is predictable once you stop fighting its guarantees and start designing for them.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.