The Transactional Outbox With Change Data Capture: No More

The bug looked impossible. An order existed in the database, fully committed, but no fulfillment had started — as if the order had been placed and then forgotten. Days later we found its twin: a fulfillment that had started for an order the database had no record of, because the transaction had rolled back after the event already fired. Both bugs had the same root cause, the oldest mistake in event-driven systems: the dual write. The service wrote the order to its database and, separately, published an event to the broker. Two writes, two systems, no shared transaction. When one succeeded and the other didn’t, reality and the event stream diverged.

The transactional outbox pattern fixes this by refusing to do two writes. Instead, the service writes the business change and the event into the same database transaction, and a separate relay later forwards committed events to the broker. Either both writes commit or neither does — the divergence is structurally impossible. AI is a strong drafting partner here because the pattern is well-defined, but the delivery semantics have sharp consequences, so you verify them.

One Transaction, Two Rows

The core move is writing the outbox row alongside the business change, atomically:

BEGIN;
  INSERT INTO orders (id, customer_id, total, status)
  VALUES ('ord_88213', 'cus_42', 19900, 'placed');

  INSERT INTO outbox (id, aggregate_id, event_type, payload, status, created_at)
  VALUES ('evt_1', 'ord_88213', 'order.placed',
          '{"order_id":"ord_88213","total":19900}', 'pending', now());
COMMIT;

If the transaction rolls back, the outbox row vanishes with the order — no ghost event. If it commits, the event is durably recorded, waiting to be published — no lost event. The discipline that makes this work is non-negotiable: the outbox insert must be inside the same transaction as the business change. The moment someone “optimizes” by inserting the outbox row after the commit, you have reintroduced the exact dual-write gap the pattern exists to close. This is the one thing to check first in any outbox implementation, including one a model drafts for you.

The Relay: Polling vs. Change Data Capture

A separate relay reads committed outbox rows and publishes them to the broker. There are two ways to build it, and the choice is a real trade-off. A polling relay queries WHERE status = 'pending' on an interval — simple, no extra infrastructure, but it adds query load and polling latency. A change-data-capture relay tails the database’s write-ahead log (via something like Debezium) and streams new outbox rows the instant they commit — lower latency and no polling load, at the cost of running and operating CDC.

Prompt: “I have an outbox table populated inside my order-write transaction, publishing to Kafka. Compare a polling relay versus a Debezium CDC relay for latency, ordering, and operational cost. Recommend one for a system doing ~500 writes/sec where 2-3s event latency is acceptable, and outline the relay’s crash-recovery behavior so an unpublished row is never skipped.”

What it returns: a trade-off table and a recommendation — typically polling for this throughput, since the latency budget is generous and CDC’s operational weight isn’t justified. The crash-recovery section is the part to read closely: it should confirm the relay marks a row published only after the broker acks.

For most teams a polling relay is the right starting point, and CDC is the upgrade you reach for when latency or query load becomes a real constraint. Whichever you choose, the relay marks a row published only after the broker acknowledges it — mark-then-publish would drop events on a relay crash in between.

At-Least-Once Means Consumers Must Dedup

Here is the honest limitation the pattern does not hide: the outbox guarantees the event is delivered, not that it’s delivered exactly once. A relay that crashes after publishing but before marking the row will republish it on restart. So the broker delivers at-least-once, and every consumer must be idempotent, deduping on a stable event ID.

def handle(event):
    if seen.exists(event["id"]):
        return                                    # already processed — duplicate delivery
    with db.transaction():
        process(event)
        seen.record(event["id"])

This is the same reasoning as idempotency keys for safe automation: delivery you can’t make exactly-once, you make safe-to-repeat. Recording the seen ID inside the same transaction as the processing keeps the dedup itself consistent. Ordering is preserved per aggregate by keying the broker partition on aggregate_id, so all events for one order arrive in order even though global ordering across orders is neither guaranteed nor usually needed.

Watch the Backlog

A subtle outbox failure is silent: the relay stalls, the application keeps committing orders and outbox rows, and from the application’s perspective everything is fine — while no events are being published and the backlog quietly grows. By the time anyone notices downstream automation has stopped, there’s a mountain of unpublished rows. So the most important metric is outbox lag: the count of pending rows and the age of the oldest one. Alert when either climbs, because a stalled relay is invisible from every other angle. Pair this with retention that archives published rows so the table doesn’t grow without bound.

Verify by Forcing Each Failure

The outbox earns its complexity only if it survives the failures it’s meant to survive, so test them on purpose. Roll back a transaction and confirm no event publishes. Kill the relay mid-batch and confirm, on restart, that every committed-but-unpublished row eventually publishes with no gaps. Stop the broker and confirm the backlog grows and alerts fire rather than events being lost. Deliver a duplicate and confirm the consumer ignores it. None of these appear on the happy path, which is precisely why ghost and lost events are so baffling when they finally surface in production.

The model drafts the schema, the relay, and the consumer dedup quickly and correctly, but the load-bearing guarantees — same-transaction insert, ack-before-mark, consumer idempotency, backlog alerting — are the ones you must verify, because each fails silently. For the design checklist see the transactional outbox prompt and the broader AI for Automation library.

The Transactional Outbox With Change Data Capture: No More Ghost Events