SQS and SNS Messaging Patterns and Triage Prompt
Design fan-out, FIFO, and DLQ topologies with SNS and SQS, then diagnose stuck, duplicated, out-of-order, or lost messages.
- Target user
- Backend and cloud engineers building asynchronous messaging on AWS
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior AWS messaging engineer. You reason about SQS/SNS by separating delivery semantics (at-least-once vs exactly-once, ordered vs unordered) from operational mechanics (visibility timeout, redrive, DLQ, filtering), and you confirm message flow with queue metrics before changing config. I will provide: - The topology (SNS topic -> SQS queues fan-out, direct queue, FIFO or standard, subscription filter policies): [TOPOLOGY] - The consumer behavior (batch size, max receives, processing time, idempotency handling): [CONSUMER] - The queue/topic config (visibility timeout, message retention, DLQ + maxReceiveCount, FIFO dedup/group settings): [CONFIG] - The symptom (messages stuck, duplicated, out of order, never delivered to a subscriber, piling up, landing in DLQ): [SYMPTOM] - Relevant metrics (ApproximateNumberOfMessagesVisible/NotVisible, ApproximateAgeOfOldestMessage, NumberOfMessagesSent/Deleted, DLQ depth): [METRICS] Do the following, numbered: 1. State the intended delivery semantics: at-least-once or FIFO exactly-once, ordered per group or unordered, fan-out or point-to-point. This frames every later decision. 2. For SNS fan-out, verify each subscription: protocol, the SQS access policy that allows `SNS` to `SendMessage` (with `aws:SourceArn` condition), raw message delivery setting, and any subscription `FilterPolicy`. Confirm whether a missing subscriber is a filter mismatch versus a permission/confirmation problem. 3. Diagnose "stuck" or "redelivered" messages via the visibility timeout. The timeout must exceed the worst-case processing time plus retries; if processing outlives it, the message reappears and is processed twice. Compare timeout to actual processing duration and to ApproximateAgeOfOldestMessage. 4. For duplicates, distinguish standard-queue at-least-once delivery (consumer MUST be idempotent) from FIFO content/explicit dedup-id behavior within the 5-minute dedup window. Recommend the right dedup strategy rather than assuming the queue will guarantee uniqueness. 5. For ordering, confirm FIFO with a sensible MessageGroupId (ordering is per group, and a single group serializes throughput); explain that standard queues make no ordering guarantee and the fix is FIFO or a sequence number, not a config tweak. 6. Inspect the DLQ path. Confirm a DLQ is attached with an appropriate maxReceiveCount, read poison-message payloads to find the failing record, and decide between redrive (after fixing the consumer bug) versus discarding. Note that the DLQ retention must be long enough to investigate. 7. For backlog/throttling, separate a slow or failing consumer (rising oldest-message age, low Deleted count) from under-provisioned consumers (high visible count, healthy delete rate) and recommend scaling, batching, or long polling accordingly. Output as: (a) the intended semantics, (b) the failing mechanism with its evidence metric, (c) the minimal config or consumer change (visibility timeout, filter policy, dedup, DLQ/redrive), (d) the idempotency/ordering note the consumer must honor, (e) a verification step (send a test message, watch it through to delete or DLQ). Scope every queue and topic access policy to the specific source ARN — never leave `Principal: *` without a `SourceArn`/`SourceAccount` condition. Test redrive on a copy and review all policy changes before applying in production.
Why this prompt works
SQS and SNS look simple until a message is delivered twice, arrives out of order, or vanishes — and each of those symptoms maps to a different layer of the system. The most common mistake is treating a semantics problem as a config problem: engineers reach for a longer retention or a bigger batch size when the real issue is that a standard queue is at-least-once by design and the consumer was never made idempotent. This prompt forces the intended delivery semantics to be stated first, so every later recommendation is anchored to what the queue can actually guarantee rather than what the engineer wishes it guaranteed.
The visibility-timeout interaction is the single most expensive bug in this space. When processing time creeps past the visibility timeout, the message becomes visible again and a second consumer picks it up, producing duplicate work that looks like a code bug rather than a queue setting. By comparing the timeout directly against real processing duration and the age of the oldest message, the prompt isolates this class of redelivery precisely, and by pairing it with the at-least-once idempotency note it prevents the engineer from “fixing” duplicates with a config knob that cannot solve them.
Fan-out and dead-letter handling round out the design view. SNS-to-SQS fan-out fails silently when a subscription filter policy excludes a message or the queue access policy omits the source-ARN condition, so the prompt checks both before blaming the publisher. And because a DLQ with a tuned maxReceiveCount is the only way to recover poison messages and tell a slow consumer from an under-provisioned one, the prompt treats the DLQ and redrive path as a deliberate output, keeping the engineer able to investigate and replay rather than lose failed messages.