Idempotency Receipt Stores: Making Retries Safe by

Retries are everywhere in automation, and most of them are unsafe. A workflow engine retries a failed activity. A client retries a request that timed out — even though the server actually processed it, the response just never arrived. A queue redelivers a message. Every one of these is a second execution of an operation that already happened, and if the operation has a side effect — a charge, a provision, a notification — the retry duplicates it. The customer gets billed twice. Two servers get provisioned. The alert fires again.

An idempotency key is the standard fix: the caller attaches a unique key, and the server promises that the same key produces the same result without re-running the side effect. But the key is only a promise; the receipt store behind it is what keeps the promise, and that store is where most implementations quietly fail. Building it correctly is a small amount of code wrapped around one genuinely hard requirement: atomicity. AI drafts the structure well; you verify the atomicity under real concurrency, because that’s the part that fails silently.

The Atomic Claim Is Everything

The naive receipt store checks whether the key exists, sees it doesn’t, and proceeds to run the operation. This is wrong, and it’s wrong in a way that passes every single-threaded test. Two concurrent retries with the same key both run the check, both see “not found,” and both execute. The side effect happens twice. The fix is to make the transition from “key absent” to “key claimed” a single atomic operation — a conditional insert or compare-and-set — so exactly one request wins.

def with_idempotency(key, request_body, operation):
    # atomic claim: only ONE concurrent request transitions absent -> in_progress
    claimed = store.insert_if_absent(key, {
        "status": "in_progress",
        "body_hash": hash(request_body),
        "created_at": now(),
    })
    if not claimed:
        existing = store.get(key)
        if existing["body_hash"] != hash(request_body):
            raise Conflict("key reused with different body")   # client bug — surface it
        if existing["status"] == "in_progress":
            raise InProgress("retry shortly")                  # first attempt still running
        return existing["result"]                              # replay original outcome

    result = operation()                                       # we won the claim: run once
    store.update(key, {"status": "succeeded", "result": result})
    return result

The insert_if_absent is the linchpin. In Postgres it’s an INSERT ... ON CONFLICT DO NOTHING; in Redis it’s SET key value NX; in DynamoDB it’s a conditional write. What it must never be is a separate read followed by a write, because the gap between them is the race that lets two requests both think they’re first. When a model drafts a receipt store, this is the line to scrutinize — a check-then-write disguised as a claim is the most common and most dangerous bug.

Decide the Awkward Cases Explicitly

A receipt store forces decisions that teams usually leave implicit and regret. The code above handles three of them deliberately. A key reused with a different body is a conflict, not a silent replay — returning the original result for a genuinely different request hides a client bug and can return the wrong answer entirely. A key that’s still in_progress means a concurrent first attempt hasn’t finished, so the second request waits or gets told to retry, rather than racing it. And a succeeded key replays the stored result so the retry gets the same answer the original did.

The one case the snippet leaves open is failure: if the first attempt failed, is the key burned or retryable? There’s no universal answer, which is exactly why it must be a conscious choice. Burn it and a transient failure permanently blocks a request that would succeed on a fresh try. Make it retryable and you must be sure the failed attempt left no partial side effect. This mirrors the poison-message and DLQ distinction between transient and permanent failure, applied at the single-request level.

Prompt: “Design an idempotency receipt store for a payment-capture endpoint. The datastore is Postgres. Write the atomic claim as an INSERT … ON CONFLICT, model receipt states (in_progress, succeeded, failed), and specify behavior for: a duplicate key, a concurrent same-key request, a reused key with a different body, and a retry after a failed first attempt. Add a TTL strategy and a concurrency test plan that fires N simultaneous requests and asserts exactly one capture.”

What it returns: a schema, an ON CONFLICT claim, a state diagram, explicit handling for each awkward case, and a concurrency test plan. The failed-attempt semantics and the conflict-on-different-body are the parts to read carefully — they’re the decisions, not the boilerplate.

TTL and Cleanup

A receipt store that never forgets grows forever. Set a TTL appropriate to how long retries can plausibly arrive — a few hours for a synchronous API, longer for a queue that might redeliver after an outage. The consistency level matters too: the claim’s atomicity guarantee only holds if the store provides it, so a cache with eventual consistency is the wrong backing store for the claim even if it’s fine for the cached result. Choose a store whose conditional write is genuinely atomic, and set the TTL long enough that no legitimate retry outlives its receipt.

Verify Under Real Concurrency

This is the verification that actually matters, and it’s the one a quick test skips. Fire many simultaneous requests with the same idempotency key at the endpoint in staging, and assert that exactly one side effect occurred — one charge, one provision, one row. A single-threaded test will pass even on a broken check-then-write store, giving you false confidence; only real concurrency exposes the race. An idempotency layer that isn’t atomic is worse than none, because it makes everyone believe retries are safe right up until the retry storm that double-charges.

The pattern holds with the rest of AI for Automation: the model drafts the schema, the claim, the state machine, and the test plan quickly and competently, but the atomic claim and the failed-attempt semantics are load-bearing, and the only proof is concurrent traffic. For the design checklist, see the idempotency receipt store prompt and the webhook dedupe companion.

Idempotency Receipt Stores: Making Retries Safe by Construction

The Atomic Claim Is Everything

Decide the Awkward Cases Explicitly

TTL and Cleanup

Verify Under Real Concurrency

Download the Free 500-Prompt DevOps AI Toolkit