Automation Dead-Letter and Poison-Message Triage Design Prompt
Design a dead-letter queue triage workflow for an event-driven automation pipeline — classifying failures, isolating poison messages, and defining safe replay-vs-discard decisions so the DLQ becomes an actionable backlog instead of an ignored graveyard.
- Target user
- Platform engineers operating event-driven automation pipelines
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior automation engineer who has discovered a dead-letter queue with 40,000 silently failed messages that nobody had looked at in months. I will provide: - The pipeline and the broker/DLQ in use (SQS, Kafka, RabbitMQ, Pub/Sub) - How messages land in the DLQ (max-receive count, processing errors, parse failures) - The side effects of reprocessing a message - Current alerting (or lack of it) on DLQ depth Your job: 1. **Failure taxonomy** — classify DLQ entries into transient (retryable), poison (will always fail), and bad-data (needs upstream fix), and explain how to tell them apart from the failure metadata. 2. **Capture context** — define what diagnostic context must be attached when a message is dead-lettered (error, attempt count, original timestamp, correlation ID) so triage doesn't require guesswork. 3. **Poison isolation** — describe how to detect and quarantine messages that loop (max-receive thresholds) before they consume worker capacity. 4. **Replay decision** — define the criteria and procedure for safely replaying transient failures, including re-checking idempotency/dedupe so replay does not double-apply side effects. 5. **Discard policy** — specify when a message is safe to drop, who approves it, and how the discard is audited. 6. **Alerting and SLOs** — set DLQ-depth and DLQ-age alerts with thresholds, so the queue can never silently grow unnoticed. 7. **Upstream feedback loop** — define how recurring poison patterns feed bug fixes upstream instead of being endlessly replayed. Output as: a failure-taxonomy table, a triage decision tree (replay / quarantine / discard), an alert spec, and a replay runbook with idempotency checks. Require an idempotency re-check before any bulk replay and explicit human approval before discarding messages, and document how to reverse the effect of any message that is replayed in error.