Fixing Terraform State Drift Before It Bites You

Drift is the gap between what your Terraform code says exists and what actually exists in the cloud. Someone clicks something in the console at 3pm on a Friday, and now your code and reality disagree. In 25 years I’ve learned that drift isn’t a question of if — it’s how much, and whether you find out before or after it breaks an apply.

Here’s how I keep drift under control.

Where drift comes from

Drift has a handful of usual suspects:

Console cowboys — an engineer fixes something manually during an incident and never codifies it.
Other automation — autoscalers, operators, or backup tools mutating resources Terraform thinks it owns.
Provider-side defaults — the cloud filling in a field you didn’t specify, which Terraform then wants to “correct.”

The first two are real drift you must reconcile. The third is often noise you need to teach Terraform to ignore.

Detecting drift on a schedule

Don’t wait for an apply to discover drift. Run a read-only detection plan on a schedule:

terraform plan -detailed-exitcode -refresh-only

The -detailed-exitcode flag returns 2 when there are changes, which makes it trivial to wire into CI:

- name: Drift check
  run: |
    terraform plan -detailed-exitcode -refresh-only || \
      [ $? -eq 2 ] && echo "DRIFT DETECTED" && exit 1

I run this nightly across the estate. A drift report every morning beats a surprise during a deploy.

Reading a refresh-only plan correctly

-refresh-only is the key tool people underuse. It updates state to match reality without proposing to change reality. The output tells you precisely what changed out-of-band:

~ resource "aws_security_group" "api" {
    ~ ingress = [
        + { from_port = 22, cidr_blocks = ["0.0.0.0/0"] },
      ]
  }

That’s not Terraform wanting to add an SSH rule — that’s Terraform telling you someone already did, in the console. Now you decide: is this a legitimate change to codify, or an unauthorized one to revert?

Reconciling: codify or revert

Once you know what drifted, you have two honest choices:

Codify it. The manual change was correct — add it to your HCL so code matches reality. Then a normal apply is a no-op.
Revert it. The change was unauthorized or wrong — run a standard apply and let Terraform put reality back to what the code says.

What you must not do is ignore it. Unreconciled drift compounds. Six months of “we’ll deal with it later” turns a five-minute reconcile into a multi-day archaeology project.

To accept the current reality into state without changing infrastructure:

terraform apply -refresh-only

Taming provider-side noise

Some drift isn’t real. A cloud provider sets a default tag or normalizes a value, and Terraform wants to “fix” it every plan. Silence that with ignore_changes:

resource "aws_autoscaling_group" "workers" {
  desired_capacity = 3
  lifecycle {
    ignore_changes = [desired_capacity]
  }
}

Here I’m telling Terraform: the autoscaler owns desired_capacity, stop trying to reset it. Use this surgically — over-applying ignore_changes blinds you to real drift.

Where AI helps with drift

A refresh-only plan over a large estate can be hundreds of lines. I paste it in and ask: “Group these drifted resources into ‘looks like a real human change worth codifying’ versus ‘provider default noise I should ignore_changes.’” It’s genuinely good at that triage, which saves me reading every diff line by line.

I’ll also ask it to draft the ignore_changes lifecycle blocks once I’ve decided what’s noise. For the recurring patterns, I keep a set of Terraform prompts handy. And I run reconciliation PRs through our Code Review tool so a second set of eyes confirms I’m codifying the right thing, not rubber-stamping an unauthorized console change.

The hard rule stands: AI reads the plan and proposes; I run the commands.

Prevent the drift in the first place

The real fix is cultural and technical:

Lock down console write access in production. If humans can’t click, they can’t drift.
Require change-by-code with break-glass as the rare exception that gets reconciled same-day.
Run scheduled drift detection so the feedback loop is hours, not months.

The takeaway

Drift is inevitable, but surprise drift is a choice. Detect it on a schedule with -refresh-only, reconcile every finding deliberately, and silence true provider noise with surgical ignore_changes. Catch it nightly and it’s a chore. Catch it during an apply and it’s an incident.