The Saga Pattern: Compensating Transactions for Ops Automation
Multi-step automation has no rollback button. Here's how the saga pattern and compensating transactions let your workflows unwind cleanly when step four fails.
- #automation
- #saga
- #orchestration
- #reliability
- #sre
The first time an automation half-completed a tenant provisioning run — created the namespace, attached the storage, then died before wiring up DNS — I learned that “just roll it back” is a fantasy. There was no transaction to roll back. Three real side effects had landed across three different systems, and the database had no idea any of it happened. I spent the next hour unwinding it by hand.
That mess is exactly what the saga pattern exists to prevent. A saga breaks a long, distributed operation into a sequence of steps, where each step has a matching compensating action that undoes it. If step four fails, you run the compensations for steps three, two, and one in reverse. No global transaction, no two-phase commit — just disciplined unwinding. In 25 years of automating infrastructure, this is the pattern I reach for whenever a workflow touches more than one system.
Why distributed ops can’t use a database transaction
When everything lives in one database, you wrap it in BEGIN ... COMMIT and the engine handles atomicity. Ops automation almost never has that luxury. A single provisioning flow might call the Kubernetes API, a cloud provider, a DNS service, and a billing system. None of them share a transaction boundary.
So you get partial failures: some steps committed, some did not, and nothing automatically reverses the committed ones. The saga pattern accepts this reality. Instead of pretending you have atomicity, you make every forward step reversible.
Anatomy of a saga
A saga is a list of (action, compensation) pairs:
saga = [
Step(action=create_namespace, compensate=delete_namespace),
Step(action=attach_storage, compensate=detach_storage),
Step(action=configure_dns, compensate=remove_dns),
Step(action=enable_billing, compensate=disable_billing),
]
The executor runs actions forward, recording each one as it succeeds. On failure it walks the completed steps backward and runs their compensations:
def run_saga(saga, ctx):
completed = []
try:
for step in saga:
step.action(ctx)
completed.append(step)
except Exception as e:
for step in reversed(completed):
try:
step.compensate(ctx)
except Exception as ce:
# Compensation itself failed — escalate, do not swallow
page_human(step, ce, ctx)
raise SagaAborted(failed_at=step, cause=e)
The subtlety is in that inner except: a compensation can fail too. When it does, you cannot just log and move on. You have an orphaned side effect and you need a human. Silent compensation failures are how sagas turn into the exact mess they were meant to avoid.
Compensations are not rollbacks
A rollback restores the previous state byte-for-byte. A compensation is a new forward action that semantically reverses the original. They are not the same thing, and conflating them causes bugs.
If your action sent a “tenant created” email, the compensation is not “un-send the email” — that is impossible. It is “send a tenant-creation-cancelled email.” If the action charged a card, the compensation is a refund, which itself is a new transaction with its own ID and its own failure modes.
This means compensations must be idempotent and tolerant of “the thing I’m undoing was never fully done.” Your delete_namespace compensation should succeed even if the namespace was never created, because a failure early in the saga still triggers compensation of every completed step.
Pro Tip: Write the compensation in the same commit as the action, never as a follow-up ticket. A forward step without a tested compensation is a landmine. If you can’t describe how to undo a step, you don’t understand the step well enough to automate it.
Where AI fits — and where it absolutely does not
This is the kind of mechanical, pattern-heavy code an LLM drafts well. I treat tools like Claude or Cursor as a fast junior engineer here: hand it the action functions and ask it to draft the matching compensations and the executor loop. It will get the structure right and save you an hour of boilerplate.
What it cannot do is decide what reversing a step means in your business. Whether a failed billing step should refund, void, or credit is a domain decision a human owns. Whether a half-provisioned tenant should be torn down or left for manual cleanup is a judgment call with real money attached. The model proposes the mechanics; you own the semantics. And it never gets prod credentials — generated compensation code runs in CI against a sandbox first, every time.
I keep a reviewed set of saga prompts in my prompt workspace so the drafting starts from patterns I already trust rather than a blank page.
Approval gates and blast-radius scoping
Compensations are destructive by nature — they delete, detach, and refund. That makes them prime candidates for an approval gate. In any saga where compensation touches production data, I require the executor to pause before running destructive compensations beyond a threshold:
def compensate_with_gate(step, ctx):
if step.blast_radius > ctx.auto_threshold:
approval = request_approval(step, ctx, timeout="15m")
if not approval.granted:
park_for_manual_cleanup(step, ctx)
return
step.compensate(ctx)
Scope the blast radius too. A saga that provisions one tenant should never be able to compensate other tenants’ resources because of a loose label selector. Constrain every compensation to the exact IDs the saga created, captured in the saga’s own context — not a broad query that could sweep up unrelated resources.
Orchestrated vs. choreographed sagas
There are two ways to run sagas. Orchestrated sagas have a central coordinator (the executor above) that drives every step and knows the whole plan. Choreographed sagas have each service emit an event when it finishes, and the next service reacts — no central brain.
For ops automation I almost always prefer orchestration. You get one place to see the saga’s state, one place to add approval gates, and one log to read at 3am. Choreography scales better for very high-throughput, loosely-coupled systems, but it scatters the failure-handling logic across services and makes “what state is this saga in?” genuinely hard to answer. A durable engine like Temporal gives you orchestrated sagas with crash-survival for free; for simpler flows, a single executor with a persisted step log is plenty.
A back-out path that humans can drive
Even with perfect compensations, you need an escape hatch. Persist the saga’s step log so that if automated compensation fails, a human can see exactly which steps completed and run the remaining compensations manually. I expose this through the same operational surface I use for everything else, alongside the incident-response dashboard, so the on-call engineer can pick up a stuck saga without reverse-engineering it from logs.
The log entry per step should record the inputs, the output IDs, and the timestamp. That’s what turns “the provisioning broke, good luck” into “step three created storage volume vol-abc123; run detach on that and you’re clean.”
Conclusion
The saga pattern is not glamorous, but it is the difference between automation you trust across system boundaries and automation that leaves you cleaning up by hand. Pair every forward action with a tested, idempotent compensation. Gate the destructive ones. Scope their blast radius to exactly what the saga created. Let AI draft the boilerplate, but keep the semantics — and the credentials — firmly in human hands.
If you’re building out this kind of resilient workflow, the automation category has more on orchestration and safe remediation, and the prompt packs include reviewed templates for generating compensation logic safely.