Skip to content
CloudOps
Newsletter Sign up
All guides
AI for Automation By James Joyner IV · · 10 min read

The Transactional Outbox Pattern for Reliable Event Automation

Your automation wrote to the database but the event publish failed — now downstream is out of sync. The outbox pattern makes state changes and events atomic.

  • #automation
  • #outbox
  • #event-driven
  • #messaging
  • #reliability

The bug that taught me the outbox pattern was maddening because the code looked obviously correct. The service updated an order in Postgres, then published an order.shipped event to the message broker so downstream automation could notify the customer and update inventory. Update, then publish. Two lines. What could go wrong?

What went wrong: the database commit succeeded, and then — before the publish — the process crashed. The order was shipped in our database, but no event ever fired. The customer was never notified and inventory never decremented. The database and the event stream had silently diverged, and there was no error anywhere to tell us. This is the dual-write problem, and the transactional outbox pattern is the clean fix.

Why “write then publish” is broken

The trap is that writing to your database and publishing to a broker are two separate systems with no shared transaction. Any time you do two independent writes that need to both happen or neither happen, you have a dual-write problem, and there’s no ordering that saves you:

  • Publish then write: the publish succeeds, the write fails. Downstream acts on an event for a state change that never persisted.
  • Write then publish: the write commits, the publish fails (or the process dies). The state changed but nobody downstream hears about it.

Either way, a crash in the gap leaves your database and your event stream inconsistent, with no error to alert you. You cannot fix this with retries or ordering. You fix it by making the two writes one atomic write.

The outbox: one transaction, two effects

The insight is simple. Instead of publishing to the broker inside your business transaction, you insert the event into an outbox table in the same database transaction as your state change. One transaction, so it’s atomic — either both the state change and the event row commit, or neither does.

BEGIN;
  UPDATE orders SET status = 'shipped' WHERE id = 42;
  INSERT INTO outbox (id, topic, payload, created_at)
    VALUES (gen_random_uuid(), 'order.shipped',
            '{"order_id": 42}', now());
COMMIT;

Now there’s no gap. If the transaction commits, the event is durably recorded alongside the state change. If anything fails, both roll back. The database’s own atomicity guarantee — the thing you already trust — now covers the event too.

The relay: getting events out of the outbox

The event is safely in the table; now a separate process — the relay — reads unpublished outbox rows and publishes them to the broker, marking them done:

def relay():
    rows = db.query(
        "SELECT * FROM outbox WHERE published_at IS NULL ORDER BY created_at LIMIT 100"
    )
    for row in rows:
        broker.publish(row.topic, row.payload)
        db.execute("UPDATE outbox SET published_at = now() WHERE id = %s", row.id)

The relay runs continuously. If it crashes mid-batch, it just resumes — rows it didn’t mark stay NULL and get picked up next pass. This is where the guarantee shifts: the outbox gives you at-least-once delivery. The relay might publish a row, crash before marking it, and publish it again on restart.

Pro Tip: For production, prefer change-data-capture (Debezium tailing the Postgres WAL) over polling the outbox table. Polling adds latency and database load; CDC streams committed outbox rows to the broker in near-real-time with no SELECT loop. Start with polling to understand the pattern, then graduate to CDC when latency or load matters.

At-least-once means consumers must be idempotent

The outbox guarantees every event is published at least once — never zero times, possibly twice. That pushes a requirement downstream: every consumer of these events must be idempotent, keyed on the event’s id. A redelivered order.shipped must not notify the customer twice or decrement inventory twice.

def on_order_shipped(event):
    if not processed.put_if_absent(event["id"]):
        return        # already handled this event id
    notify_customer(event)
    decrement_inventory(event)

The outbox solves the producer-side dual-write; consumer idempotency solves the at-least-once duplication it introduces. You need both. The outbox without idempotent consumers just moves the inconsistency downstream.

Where AI helps

The outbox plumbing — the table schema, the relay loop, the consumer dedupe — is mechanical and well-documented, which makes it solid fast-junior-engineer work. I describe my stack to Claude or Copilot and it drafts the migration, the relay, and the idempotent consumer scaffolding cleanly.

The judgment that stays human is which state changes deserve an outbox event at all, and what each event’s contract is. An event is a public API for your downstream automation; getting its shape wrong ripples through every consumer. The model can draft the relay, but it doesn’t own the event schema decisions or the operational rollout — and it never touches the production database credentials. Generated relay code runs against a local Postgres first. I keep my outbox prompts in the prompt workspace so they start from reviewed patterns.

Operating the outbox

The outbox has one failure mode you must watch: the relay falling behind or stopping. If the relay dies, events accumulate in the table unpublished — your database is correct, but downstream is silently starving. So the metric that matters is outbox lag: the age of the oldest unpublished row.

SELECT now() - min(created_at) AS lag FROM outbox WHERE published_at IS NULL;

I alert on that lag crossing a threshold through the monitoring-alerts dashboard. A growing outbox is the early warning that your event automation has quietly stalled, well before any customer notices the missing notifications.

The back-out and recovery story is clean too: because every event is durably in the table, a relay outage is fully recoverable. Fix the relay and it drains the backlog, idempotent consumers absorb any duplicates, and the systems re-converge. No events were ever lost — they were just waiting.

Conclusion

The transactional outbox pattern kills the dual-write problem by making your state change and its event a single atomic database transaction. A relay drains the outbox to the broker at-least-once, idempotent consumers absorb the duplicates, and outbox lag is the metric that tells you the pipeline is healthy. Let AI draft the plumbing, but own the event contracts and keep production credentials out of the model.

The automation category covers the companion patterns — idempotency keys, webhook fan-out, and DLQ triage — and the prompts library has reviewed templates for outbox relays and idempotent consumers.

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,300+ DevOps AI prompts
  • One practical workflow email per week