Automation Error Guide: 'Poison Message' Dead-Letter Queue Redelivery Loop
Fix dead-letter queue poison messages and infinite redelivery loops: diagnose deserialization failures, missing ack, visibility timeout, max-receive, and no DLQ configured.
- #automation
- #troubleshooting
- #errors
- #queues
Overview
A poison message is a message a consumer can never successfully process — it fails every time it’s delivered. In a queue with at-least-once delivery, a message that is received but not acknowledged becomes visible again and is redelivered. If the consumer keeps throwing on it, the broker keeps redelivering it: an infinite redelivery loop that burns CPU, blocks the partition/queue head, and starves healthy messages. A correctly configured queue eventually routes the message to a dead-letter queue (DLQ) after N attempts; a misconfigured one loops forever.
You will see the consumer log the same message ID repeatedly:
ERROR consumer failed to process message id=msg-7f3a deliveryCount=58 error=JSON parse: unexpected token at position 0
WARN redelivering id=msg-7f3a (nack, requeue=true)
Or the broker reporting redelivery counts climbing on one message:
queue=orders message=msg-7f3a redelivered=true delivery_count=58
It occurs whenever a consumer nacks/rejects with requeue, or simply fails to ack within the visibility/lock window. A single malformed event published upstream can stall an entire queue until the poison message is dead-lettered or removed.
Symptoms
- The same message ID appears in consumer error logs over and over with a rising delivery/receive count.
- Queue depth stays flat or grows while throughput drops to near zero (head-of-line blocking).
- CPU on the consumer is high but no useful work completes.
- A DLQ is empty (loop never terminates) or filling rapidly (loop terminates but root cause unfixed).
# RabbitMQ: see redelivery and the stuck head message
rabbitmqadmin get queue=orders count=1 requeue=true
| routing_key | redelivered | message_count | payload |
| orders | True | 1041 | \x00\x01bad... |
# SQS: inspect approximate receive count on a message
aws sqs receive-message --queue-url "$Q" \
--attribute-names ApproximateReceiveCount --max-number-of-messages 1 \
--query 'Messages[0].Attributes.ApproximateReceiveCount'
"62"
Common Root Causes
1. Deserialization failure on a malformed payload
The consumer can’t parse the body (bad JSON, wrong schema version, binary garbage), throws before doing any work, nacks, and the message comes right back.
# Pull one copy without consuming and try to parse it
aws sqs receive-message --queue-url "$Q" --max-number-of-messages 1 \
--query 'Messages[0].Body' --output text | jq . 2>&1 | head -2
parse error: Invalid numeric literal at line 1, column 1
A parse error here means no retry count will ever help — the message is structurally bad.
2. No DLQ / redrive policy configured
Without a max-receive threshold and a DLQ target, the broker has nowhere to send a repeatedly failing message, so it loops indefinitely.
aws sqs get-queue-attributes --queue-url "$Q" \
--attribute-names RedrivePolicy --query 'Attributes.RedrivePolicy'
null
null RedrivePolicy means no DLQ — a poison message can never escape the main queue.
3. Visibility timeout / lock shorter than processing time
If processing legitimately takes longer than the visibility timeout, the message becomes visible and is redelivered before the first attempt finishes — then both attempts may “fail,” inflating the count and duplicating work.
aws sqs get-queue-attributes --queue-url "$Q" \
--attribute-names VisibilityTimeout
{ "Attributes": { "VisibilityTimeout": "30" } }
A 30s visibility timeout on a job that takes 45s guarantees redelivery before completion.
4. Missing or failed acknowledgement
The consumer processes successfully but throws or crashes before sending the ack, so the broker assumes failure and redelivers.
grep -RniE "channel.ack|message.ack|deleteMessage|basic_ack|commitSync" ./consumer | head
consumer/worker.ts:41: // TODO: ack after handler <- ack never sent on the happy path
No ack on success = guaranteed redelivery even of good messages.
5. Requeue-on-every-error instead of routing to DLQ
The handler catches all errors and nacks with requeue=true, even for permanent failures that will never succeed. The broker faithfully loops.
grep -RniE "nack|reject|requeue|negativeAck" ./consumer | head
consumer/worker.ts:55: channel.nack(msg, false, true) // requeue=true for ALL errors
Permanent errors must be nacked with requeue=false (or dead-lettered), not requeued.
6. Max-receive set too high or never reached
A redrive policy exists but maxReceiveCount is enormous, so the message effectively loops for a very long time before dead-lettering.
aws sqs get-queue-attributes --queue-url "$Q" \
--attribute-names RedrivePolicy --query 'Attributes.RedrivePolicy' --output text | jq .
{ "deadLetterTargetArn": "arn:aws:sqs:...:orders-dlq", "maxReceiveCount": 1000 }
maxReceiveCount: 1000 means a poison message loops ~1000 times before help arrives.
Diagnostic Workflow
Step 1: Identify the looping message and its count
# SQS
aws sqs receive-message --queue-url "$Q" --attribute-names ApproximateReceiveCount \
--max-number-of-messages 1 --query 'Messages[0].{id:MessageId,recv:Attributes.ApproximateReceiveCount}'
# RabbitMQ
rabbitmqadmin get queue=<q> count=1 requeue=true
A high, climbing receive/delivery count on one message confirms a loop.
Step 2: Capture and inspect the payload
aws sqs receive-message --queue-url "$Q" --max-number-of-messages 1 \
--query 'Messages[0].Body' --output text | jq . 2>&1 | head
A parse error means a permanent (poison) failure; valid JSON means look at processing/ack instead.
Step 3: Confirm a DLQ / redrive policy exists
aws sqs get-queue-attributes --queue-url "$Q" --attribute-names RedrivePolicy
null → add a DLQ and maxReceiveCount. Present but huge → lower the threshold.
Step 4: Check visibility timeout vs actual processing time
aws sqs get-queue-attributes --queue-url "$Q" --attribute-names VisibilityTimeout
# Compare to your handler's p99 processing duration from logs
If processing > visibility timeout, raise the timeout (or extend it heartbeat-style mid-processing).
Step 5: Audit ack and nack/requeue logic
grep -RniE "ack|nack|reject|requeue|deleteMessage|commitSync" ./consumer
Ensure success paths ack, and permanent errors nack with requeue=false (or publish to the DLQ explicitly).
Example Root Cause Analysis
The orders queue throughput drops to zero while depth holds at ~1000. The consumer log shows one message ID failing endlessly:
ERROR failed to process id=msg-7f3a deliveryCount=58 error=JSON parse: unexpected token at position 0
WARN redelivering id=msg-7f3a (nack, requeue=true)
Capturing the payload reveals it isn’t JSON at all:
aws sqs receive-message --queue-url "$Q" --max-number-of-messages 1 \
--query 'Messages[0].Body' --output text | head -c 40 | xxd | head -1
00000000: 0001 0203 6261 6420 6269 6e61 7279 ... ....bad binary...
An upstream publisher wrote a raw binary blob instead of JSON. The consumer’s catch-all nacks with requeue=true, and because the queue has no RedrivePolicy, the message loops forever and blocks the head.
Fix: add a DLQ with a sane max-receive so poison messages exit the loop, and change permanent-error handling to not requeue:
aws sqs set-queue-attributes --queue-url "$Q" --attributes \
'RedrivePolicy={"deadLetterTargetArn":"arn:aws:sqs:...:orders-dlq","maxReceiveCount":"5"}'
# In the consumer: on a parse error, nack with requeue=false (or publish to DLQ)
The poison message is dead-lettered after 5 attempts, the queue head unblocks, and healthy orders flow again. The binary publisher is then fixed at the source.
Prevention Best Practices
- Always configure a DLQ with a small
maxReceiveCount(e.g., 3–5) so poison messages exit the main queue quickly instead of looping. - Distinguish transient from permanent errors in the consumer: retry transient (requeue), dead-letter permanent (no requeue) — never blanket-requeue everything.
- Validate/deserialize defensively and route schema/parse failures straight to the DLQ rather than throwing into a requeue.
- Set the visibility timeout / lock duration above the handler’s p99 processing time, and extend it for long jobs instead of letting it expire.
- Ack only after successful processing, and make handlers idempotent so an unavoidable redelivery doesn’t double-apply effects.
- Alert on rising per-message receive counts and DLQ depth so a stuck head is caught in minutes. The free incident assistant can summarize DLQ contents into a likely upstream cause; see more automation guides.
Quick Command Reference
# Find the looping message and its receive count (SQS)
aws sqs receive-message --queue-url "$Q" --attribute-names ApproximateReceiveCount \
--max-number-of-messages 1 --query 'Messages[0].{id:MessageId,recv:Attributes.ApproximateReceiveCount}'
# RabbitMQ head inspection
rabbitmqadmin get queue=<q> count=1 requeue=true
# Inspect the payload for poison
aws sqs receive-message --queue-url "$Q" --max-number-of-messages 1 \
--query 'Messages[0].Body' --output text | jq . 2>&1 | head
# Is a DLQ configured?
aws sqs get-queue-attributes --queue-url "$Q" --attribute-names RedrivePolicy
aws sqs get-queue-attributes --queue-url "$Q" --attribute-names VisibilityTimeout
# Add / tighten a redrive policy
aws sqs set-queue-attributes --queue-url "$Q" --attributes \
'RedrivePolicy={"deadLetterTargetArn":"<dlq-arn>","maxReceiveCount":"5"}'
# Audit ack / requeue logic
grep -RniE "ack|nack|reject|requeue|deleteMessage" ./consumer
Conclusion
A poison message and its redelivery loop come down to a message that fails every attempt with nowhere to go. The usual root causes:
- A deserialization/schema failure on a structurally bad payload.
- No DLQ / redrive policy, so the broker loops indefinitely.
- A visibility timeout shorter than processing time, redelivering before completion.
- A missing or failed ack that redelivers even good messages.
- Blanket
requeue=trueon permanent errors instead of dead-lettering. - A
maxReceiveCountset so high the loop runs for a very long time.
Identify the looping message ID and inspect its payload first — a parse failure means dead-letter it and fix the publisher, while valid data points you at visibility timeout or ack logic.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.