Detecting and Fixing Infrastructure Config Drift

The promise of infrastructure as code is that the repo is the truth. The reality, on every team I’ve worked with, is that the repo is the truth until someone touches the console. That gap between what your code says and what’s actually running is config drift, and it’s the quiet reason “but it works in staging” turns into a 3am outage.

Here’s how I think about drift: where it comes from, how to detect it before it bites, and how to close the loop so it stops recurring.

What drift actually is

Drift is any divergence between your declared desired state and the real state of your infrastructure. A security group rule added by hand. A Kubernetes replica count bumped during an incident and never reverted. An autoscaling group whose AMI was patched out-of-band. A manually-edited config file on a host Ansible “owns.”

None of these are malicious. They’re almost always someone solving a real problem under pressure. But each one makes your IaC a little more fictional, and the next apply either fails confusingly or silently reverts something important.

The three sources of drift

Knowing the source tells you how to prevent it.

Manual changes. The console, kubectl edit, SSH-and-vim. The classic. Caused by pressure, lack of access controls, or an IaC workflow that’s too slow to use during an incident.

Out-of-band automation. A separate tool — an autoscaler, a patching job, a cloud provider’s managed update — modifies a resource your IaC also manages. Two owners, one resource.

Provider-side changes. The cloud adds a default field, rotates something, or changes a computed value. Your code didn’t change; the world did.

Detection: make drift visible on a schedule

You can’t fix what you can’t see. The core technique is a read-only reconcile: ask your IaC tool what it would change, without changing anything.

For Terraform-style tools, that’s a plan against current state; for Ansible, it’s --check --diff; for Kubernetes, it’s a server-side dry-run diff. The key move is to run this on a schedule, not just before deploys:

# CI cron: nightly drift check (Ansible example)
- name: Drift check
  command: ansible-playbook site.yml --check --diff
  register: result
  failed_when: "'changed=' in result.stdout and changed_count > 0"

If the check reports changes when nobody deployed, that’s drift. Alert on it. A nightly drift report that lands in Slack turns drift from an invisible time-bomb into a visible, triageable signal.

Reading the diff with AI

A raw drift diff can be hundreds of lines of noise — reordered tags, normalized whitespace, computed fields — around the two changes that actually matter. This is a genuinely good use of AI.

Paste the diff and ask:

“This is a drift detection diff. Separate cosmetic/computed-field changes from substantive changes to actual configuration. For each substantive change, tell me the likely real-world cause and the blast radius if I revert it.”

The model is good at pattern-matching “this is just a provider default being populated” versus “someone widened this firewall rule.” It won’t be right 100% of the time, so treat it as triage, not a verdict — but it turns a 400-line diff into a three-item list you can actually act on. I keep these drift-triage prompts handy for exactly this.

Fixing drift: two valid directions

When you find drift, you have two legitimate choices, and picking the right one matters.

Revert reality to match code. Re-run your IaC and let it overwrite the manual change. Correct when the drift was an unauthorized or accidental change. Dangerous if the manual change was load-bearing — reverting a hand-added firewall rule mid-incident makes things worse.

Update code to match reality. Import the manual change into your IaC and codify it. Correct when the change was legitimate and should persist. This is how you stop the same drift from recurring — you accept the new reality into the source of truth.

The wrong move is the third option everyone secretly takes: notice the drift, shrug, and let it accumulate. That’s how you end up afraid to run apply.

Closing the loop so drift stops

Detection and fixing are firefighting. Prevention is the real win.

Lock down write access. If humans can’t edit production directly, manual drift mostly disappears. Read-only console access, changes only through the pipeline.

Continuous reconciliation. This is the GitOps answer: a controller (Argo CD, Flux) continuously compares live state to Git and either reverts drift automatically or flags it. Drift is detected in minutes, not at the next nightly run. I dig into this in the GitOps guides.

Resolve dual ownership. If an autoscaler and your IaC both manage replica counts, tell your IaC to ignore that field (ignore_changes, or simply don’t manage it). One owner per attribute.

Tag-then-reconcile for emergencies. Give people a fast, sanctioned break-glass path during incidents, with the rule that any emergency manual change must be codified within 24 hours. People drift around your process when your process is too slow — so make the right path fast.

A practical starting point

You don’t need a platform tomorrow. Start here:

Add a scheduled drift check to CI and route the output to a channel.
When it fires, triage the diff (AI helps) into real vs. cosmetic.
For each real change, decide direction — revert or codify — and do it the same day.
Once that’s a habit, tighten write access to shrink the source.
Eventually, adopt continuous reconciliation to make detection near-instant.

Drift is never fully eliminated — the cloud is a living system. But it can be made visible and small instead of invisible and accumulating. That difference is the gap between IaC you trust and IaC you’re afraid of. Keep your triage prompts in a prompt library and make the nightly check non-negotiable.