Poison-Message Quarantine and Replay Design Prompt

Design how a queue consumer detects poison messages, quarantines them after bounded retries, and supports safe operator-driven replay — so one bad message can't block the queue or be lost.

Target user

Platform engineers operating message-driven automation pipelines

Difficulty

Advanced

Tools

Claude, ChatGPT, Cursor

You are a senior platform engineer who has watched one malformed message wedge a queue and stall an entire automation pipeline. I will provide: - The broker (SQS, Kafka, RabbitMQ, Pub/Sub) and its redelivery/visibility semantics - The consumer, what processing can fail, and which failures are transient vs. permanent - Ordering requirements and whether head-of-line blocking is a concern - Message volume and how poison messages currently behave Your job: 1. **Failure classification** — distinguish transient failures (retry) from poison ones (will never succeed: bad schema, missing reference) for [CONSUMER], so retries aren't wasted on hopeless messages. 2. **Bounded retry** — set a retry/redelivery limit with backoff before a message is treated as poison, tuned to the broker's redelivery model and visibility timeout. 3. **Quarantine** — route exhausted messages to a DLQ/quarantine with full context (original payload, error, attempt count, timestamps) instead of dropping or infinitely retrying them. 4. **Head-of-line protection** — ensure a poison message at the front of an ordered partition doesn't block everything behind it; define the trade-off if strict ordering must hold. 5. **Replay path** — design operator-driven replay: inspect, optionally fix, and re-enqueue quarantined messages, with dedup so replay doesn't double-process already-handled work. 6. **Observability and alerting** — alert on DLQ growth and surface poison patterns, since a rising DLQ is often the first sign of a deploy that broke message handling. Output as: a failure-classification table, the consumer retry/quarantine pseudocode, a DLQ message schema, and a replay runbook with dedup guarantees. Reproduce a poison message in staging and walk the full path — retry, quarantine, alert, replay — before production; the failure to design for is the one bad message among millions of good ones, which only appears under real traffic.

Why this prompt works

In a message-driven pipeline, almost every message is fine — and the one that isn’t can take down everything. A poison message is one that will never process successfully no matter how many times it’s redelivered: a malformed schema, a reference to something that was deleted, a payload your consumer’s new version can’t parse. The prompt’s first move is to make the consumer distinguish these from transient failures, because the two demand opposite responses. A transient failure wants retry; a poison message wants quarantine. Retrying a poison message is pure waste, and on an ordered stream it’s worse than waste — it blocks every message behind it indefinitely, turning one bad payload into a full pipeline stall.

The prompt insists on quarantine with full context rather than the two failure modes teams actually ship: infinite retry or silent drop. Infinite retry wedges the consumer; silent drop loses work that no one can recover or even diagnose. A proper dead-letter path captures the original payload, the error, and the attempt history, which is what makes the next step — operator-driven replay — possible at all. And replay is where the prompt enforces the non-obvious safety requirement: a quarantined message may have been partially processed before it failed, so re-enqueuing it must be idempotent and deduped, or replay double-applies side effects. Replay has to be as safe as first delivery, not a blind re-send.

The observability requirement turns the DLQ from a graveyard into a signal. A rising dead-letter count is frequently the first visible symptom of a deploy that broke message handling, and alerting on it catches the regression before the backlog explodes. The model can draft the classification table, retry logic, and replay runbook quickly, but you verify by reproducing an actual poison message in staging and walking the entire path — retry, quarantine, alert, replay. That bad-message-among-millions only shows up under real traffic, so you make it show up on purpose first.

Poison-Message Quarantine and Replay Design Prompt

Why this prompt works

Related prompts

Automation Dead-Letter and Poison-Message Triage Design Prompt

Webhook Dedupe and Replay-Protection Receiver Design Prompt

Why this prompt works

Related prompts

Automation Dead-Letter and Poison-Message Triage Design Prompt

Webhook Dedupe and Replay-Protection Receiver Design Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet