Debugging Azure Service Bus With AI: Read the Dead-Letter

The dead-letter queue had thirty thousand messages in it and the team’s first instinct was to bump the max delivery count and replay them. I asked one question before they did: what’s the dead-letter reason? It was MaxDeliveryCountExceeded on a single malformed message type that failed deterministically every time — a poison message. Raising the delivery count would have retried thirty thousand broken messages more times and then dead-lettered them again. The reason told us everything, and almost nobody reads it first. That habit alone makes Service Bus debugging dramatically faster and safer.

Service Bus problems present identically on the surface — messages pile up, work runs twice, throughput stalls — but the causes are unrelated, and the fastest discriminator is the dead-letter reason plus the relationship between lock duration and processing time. AI is good here because it reasons about entity configuration and consumer behavior together, which is exactly the join you need: the queue config alone never explains a duplicate-processing problem, and the consumer code alone never explains an expiry problem.

The dead-letter reason is the diagnosis

Every dead-lettered message carries a reason, and that reason maps cleanly to a class of problem. MaxDeliveryCountExceeded means a message kept failing — poison message or failing consumer. TTLExpiredException means messages aged out before anyone processed them — a throughput or downtime problem. An application-set reason means your code rejected them on purpose. Start there, not with a config change.

# Peek the dead-letter queue and read the reasons
az servicebus queue show --name jobs --namespace-name sb-prod \
  --resource-group rg-msg --query "countDetails.deadLetterMessageCount"
# (use the SDK or Service Bus Explorer to read DeadLetterReason on the messages)

Prompt: “My Service Bus queue has 30,000 messages in the dead-letter queue. Most have DeadLetterReason MaxDeliveryCountExceeded. The consumer uses PeekLock with a 30-second lock duration and processing takes about 45 seconds per message. Diagnose the most likely root cause and tell me what NOT to do before I understand it.”

A good answer connects the dots immediately: processing (45s) exceeds the lock (30s), so locks expire mid-processing, messages redeliver, the delivery count climbs, and they eventually dead-letter — even though the consumer logic is fine. That’s a lock-duration problem masquerading as a poison-message problem, and the reason plus the timing numbers reveal it. This evidence-first habit runs through all the Azure reliability work.

Lock duration versus processing time

The most common Service Bus incident is processing that outlasts the lock. When the lock expires, Service Bus assumes the consumer died, makes the message visible again, and another consumer picks it up — so you get duplicates, redelivery, and DLQ buildup all at once. You have three fixes: lengthen the lock, renew the lock during processing, or make the work faster.

Prompt: “Processing takes 45 seconds but my lock duration is 30 seconds, causing redelivery and duplicates. Compare three fixes — increasing lock duration, renewing the lock during processing in code, and breaking the work into smaller units — and recommend which fits a long-running, non-splittable job. Show the lock-renewal approach.”

For genuinely long work, renewing the lock during processing is usually the right answer, because raising the lock duration globally affects every message including ones a crashed consumer should release quickly. AI lays out the trade-off; you pick based on whether the work can be split. The matching debug prompt is in the prompts library.

Don’t raise the delivery count to clear a poison message

When the dead-letter reason is MaxDeliveryCountExceeded and the message fails deterministically, raising the count is pure waste — it just retries a broken message more times before dead-lettering it again. The right move is to pull the poison message out of band, fix or discard it, and leave the count alone for the genuinely transient failures it’s meant to handle.

Prompt: “Some messages dead-letter with MaxDeliveryCountExceeded because they’re malformed and fail every time, but others dead-letter from transient downstream timeouts. Recommend how to separate the poison messages from the transient ones in the DLQ, handle each appropriately, and why blanket-raising max delivery count is the wrong response.”

Replay the DLQ only when consumers are idempotent

Once you’ve fixed the root cause, you’ll want to replay the dead-letter backlog. This is the step that turns a messaging incident into a data-integrity incident if you’re not careful: replaying resubmits messages, and if a side effect already happened on an earlier attempt — a charge, an email, an inventory decrement — replaying does it again.

Prompt: “I’ve fixed the consumer and want to replay 30,000 dead-lettered messages back to the main queue. Before I do, give me a checklist to confirm the consumers are idempotent (or that duplicate detection is enabled), and a safe approach to replay in batches with monitoring rather than all at once.”

The guardrail is non-negotiable: confirm idempotency or enable duplicate detection before replaying, and replay in monitored batches rather than dumping all thirty thousand back at once. AI drafts the replay plan; you verify idempotency and own the go decision. The same caution applies to choosing PeekLock over ReceiveAndDelete — the latter deletes the message before processing, so a crash loses it, which is only acceptable for loss-tolerant streams.

The loop

Service Bus rewards reading the evidence before reaching for a knob. Read the dead-letter reason first — it names the class of problem. Compare lock duration to processing time before blaming the consumer. Triage poison messages out instead of raising the delivery count. Confirm idempotency before replaying the DLQ. AI joins the entity config and consumer behavior and recalls the failure patterns; you verify the timing numbers and own the replay. Do that and a thirty-thousand-message dead-letter queue becomes a diagnosis instead of a panic. There’s more reliability material in the Azure category, and the Service Bus debug prompt is ready to copy from the prompts library.

Debugging Azure Service Bus With AI: Read the Dead-Letter Reason First