AI for Automation Difficulty: Intermediate ClaudeChatGPT

Automation Dead-Letter and Poison-Message Triage Design Prompt

Design a dead-letter queue triage workflow for an event-driven automation pipeline — classifying failures, isolating poison messages, and defining safe replay-vs-discard decisions so the DLQ becomes an actionable backlog instead of an ignored graveyard.

Target user: Platform engineers operating event-driven automation pipelines
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior automation engineer who has discovered a dead-letter queue with 40,000 silently failed messages that nobody had looked at in months.

I will provide:
- The pipeline and the broker/DLQ in use (SQS, Kafka, RabbitMQ, Pub/Sub)
- How messages land in the DLQ (max-receive count, processing errors, parse failures)
- The side effects of reprocessing a message
- Current alerting (or lack of it) on DLQ depth

Your job:

1. **Failure taxonomy** — classify DLQ entries into transient (retryable), poison (will always fail), and bad-data (needs upstream fix), and explain how to tell them apart from the failure metadata.
2. **Capture context** — define what diagnostic context must be attached when a message is dead-lettered (error, attempt count, original timestamp, correlation ID) so triage doesn't require guesswork.
3. **Poison isolation** — describe how to detect and quarantine messages that loop (max-receive thresholds) before they consume worker capacity.
4. **Replay decision** — define the criteria and procedure for safely replaying transient failures, including re-checking idempotency/dedupe so replay does not double-apply side effects.
5. **Discard policy** — specify when a message is safe to drop, who approves it, and how the discard is audited.
6. **Alerting and SLOs** — set DLQ-depth and DLQ-age alerts with thresholds, so the queue can never silently grow unnoticed.
7. **Upstream feedback loop** — define how recurring poison patterns feed bug fixes upstream instead of being endlessly replayed.

Output as: a failure-taxonomy table, a triage decision tree (replay / quarantine / discard), an alert spec, and a replay runbook with idempotency checks.

Require an idempotency re-check before any bulk replay and explicit human approval before discarding messages, and document how to reverse the effect of any message that is replayed in error.

Free: the DevOps AI Incident-Triage Cheat Sheet