Building Reconciliation Loops for Self-Correcting Automation
Imperative scripts fire once and forget. Reconciliation loops continuously converge reality to desired state, so automation heals drift instead of just hoping.
- #automation
- #reconciliation
- #controllers
- #drift
- #sre
Most automation I wrote early in my career was imperative and fire-once: a script that created three users, opened two firewall rules, and exited. It worked the moment it ran. Then someone manually deleted a firewall rule a week later, and nothing put it back, because the script had no concept of “make sure these rules exist” — only “create these rules now.” The system drifted, silently, until something broke.
Kubernetes controllers taught the whole industry a better model: don’t do a thing once, continuously converge reality toward a declared desired state. A reconciliation loop reads the desired state, observes the actual state, computes the diff, and makes only the changes needed to close the gap — over and over, forever. This is the pattern that turns brittle one-shot automation into something that self-corrects. And you can apply it far beyond Kubernetes.
Imperative vs. declarative reconciliation
The mental shift is from steps to state. An imperative script says “create rule X.” A reconciliation loop says “rule X should exist” and figures out the steps each cycle.
The difference shows up exactly when something drifts. The imperative script already ran; it has no opinion about the rule being gone. The reconciliation loop notices the rule is missing on its next pass and recreates it. Drift correction isn’t a feature you bolt on — it’s an emergent property of the loop. That’s why this pattern is the backbone of self-healing automation.
The four-step loop
Every reconciliation loop, whether it’s a Kubernetes controller or a cron job managing DNS records, has the same skeleton:
def reconcile(desired, observed_fn, apply_fn, delete_fn):
observed = observed_fn() # what actually exists
desired_set = {r.key: r for r in desired}
observed_set = {r.key: r for r in observed}
to_create = desired_set.keys() - observed_set.keys()
to_delete = observed_set.keys() - desired_set.keys()
to_update = {k for k in desired_set & observed_set
if desired_set[k] != observed_set[k]}
for k in to_create: apply_fn(desired_set[k])
for k in to_update: apply_fn(desired_set[k])
for k in to_delete: delete_fn(observed_set[k])
Read desired, observe actual, diff, converge. Run it on a timer or trigger it on events, and the system continuously self-corrects. The whole thing hinges on apply_fn being idempotent — applying a resource that already matches must be a no-op, because the loop will call it again and again.
The delete branch is where it gets dangerous
That to_delete set is the scary part, and it’s where reconciliation loops bite people. The loop will delete anything that exists but isn’t in the desired state. If your desired list is computed wrong — a config file fails to load and comes back empty, an API returns a partial list — the loop concludes that everything should be deleted and cheerfully tears down your whole environment.
This is a real and recurring outage pattern: a reconciliation loop that “helpfully” deleted production because its desired state momentarily evaluated to empty. Guard against it explicitly:
if len(desired) == 0 and len(observed) > 0:
raise SafetyAbort("desired state empty but resources exist — refusing to delete all")
if len(to_delete) > MAX_DELETES_PER_CYCLE:
request_approval(to_delete) or abort()
A loop that can delete unbounded resources with no sanity check is not automation — it’s a loaded gun on a timer.
Pro Tip: Run the loop in observe-only mode first. Have it log the create/update/delete sets it would apply, without applying them, for a few days. The diffs it reports are your reality check: if it’s proposing to delete things you know should stay, your desired-state computation is wrong, and you found out from a log instead of an outage.
Blast-radius scoping and rate limiting
A reconciliation loop touches everything in its scope every cycle, so scope it tightly. One loop per bounded domain — DNS records for one zone, firewall rules for one VPC — not one omni-loop that reconciles your entire infrastructure. Narrow scope means a bug in one loop can’t cascade across unrelated systems.
Rate-limit the convergence too. When a loop detects a large diff, applying all of it at once can overwhelm downstream APIs or, if the diff is wrong, do maximum damage instantly. Cap changes per cycle and let the loop converge over several passes. Slower convergence with a bounded blast radius beats instant convergence that can’t be stopped.
Where AI fits
The loop’s structure — the diffing, the observe/apply functions, the safety guards — is well-trodden and a good fast-junior-engineer task. I’ll describe a resource type to Claude or Cursor and have it draft the observed_fn, an idempotent apply_fn, and the diff logic. It’s reliable at the mechanics.
What AI does not get is the authority to run the loop unattended against production with its own credentials. The dangerous part of a reconciliation loop is the destructive convergence, and that needs the same discipline as any automated action: scoped credentials, the empty-desired-state guard, and an approval gate above a threshold of deletes. The model proposes the loop; a human reviews the safety guards and owns the decision to let it run live. For anything that deletes, I keep the loop proposing through a gate rather than acting directly — the same proposer/approver split I apply to AI-suggested remediations. My reconciliation-loop prompts live in the prompt workspace.
Make the loop observable and interruptible
A loop that converges silently is hard to trust. Emit a metric every cycle for the size of each diff set, so a sudden spike in pending deletes is visible before the loop acts on it. I route that to the monitoring-alerts dashboard — a reconciliation loop suddenly wanting to delete forty resources is exactly the kind of signal that should page a human, not auto-execute.
Build in a kill switch too. A single flag the loop checks at the top of each cycle, that lets an on-call engineer pause convergence instantly when something looks wrong. The back-out path for a misbehaving reconciliation loop is stop reconciling — and that has to be one action away, not a code deploy away.
When not to use a reconciliation loop
Reconciliation loops are the right tool for managing sets of resources that should match a declared state. They’re the wrong tool for one-shot, irreversible actions — sending an email, charging a card, kicking off a deploy. Those are events, not states; wrapping them in a loop just risks doing them repeatedly. Reach for a saga or a durable workflow there instead. The loop’s superpower is convergence, which only makes sense when “the same end state” is well-defined and re-achievable.
Conclusion
Reconciliation loops turn fire-once automation into self-correcting systems that heal drift by continuously converging reality to a declared desired state. The pattern is four steps — read desired, observe actual, diff, apply — and the entire risk lives in the delete branch. Guard against empty desired state, cap deletes per cycle, scope each loop tightly, and make it observable and interruptible. Let AI draft the loop, but keep the safety guards, the credentials, and the decision to run it live with a human.
The automation category covers the related self-healing and drift patterns, and the prompt packs include reviewed reconciliation-loop templates with the safety guards built in.