Triaging Terraform Drift Alerts With AI Without Blind

Once you set up continuous drift detection, you discover a new problem: the alerts never stop. Someone changed a tag in the console, an autoscaling group resized itself, AWS added a default value the provider didn’t know about — every one of these shows up as drift, and 90% of them are noise. The temptation is to make the cron job run terraform apply to “fix” the drift automatically. That is how you delete something a human did on purpose during an incident.

Drift triage is a classification problem, and classification is exactly what AI does well. Use it as a fast junior engineer who reads every drift report and sorts them into “ignore,” “reconcile,” and “investigate” — then a human looks at the short pile that actually matters. What the AI never does is decide to reapply. There is no auto-reconcile. A person approves every change that touches real infrastructure.

Drift isn’t always Terraform being out of date. Sometimes drift is intentional and correct:

An on-call engineer bumped an instance size at 3am to survive a traffic spike.
A security team revoked a security group rule during an incident.
A manual hotfix that hasn’t been codified yet.

If your pipeline blindly runs terraform apply to erase drift, it reverts all of these — undoing a fix, restoring a vulnerable rule, or downsizing during a load event. “Drift” and “unauthorized change” are not synonyms, and only a human knows the difference. The AI’s job is to flag which drift looks like which, not to act.

Feed it structured drift, not raw output

Generate drift as a refresh-only plan and render it as JSON:

terraform plan -refresh-only -out=drift.tfplan
terraform show -json drift.tfplan > drift.json

-refresh-only shows the delta between state and reality without proposing config changes — pure drift. Extract just the drifted resources:

jq '[.resource_drift[]? | {address, actions: .change.actions}]' drift.json

Hand that to the AI with the triage framing:

Here are the resources that drifted from Terraform state. Classify each as: (1) BENIGN — cosmetic or provider-default noise, safe to update state; (2) RECONCILE — Terraform should reassert config; (3) INVESTIGATE — looks like an intentional out-of-band change a human made. Explain each classification in one line.

Now your 30 drift alerts become “3 to investigate, 5 to reconcile, 22 noise.” That sorting is the entire value, and it’s low-risk because the output is a list, not an action.

Teach it your specific noise

Generic triage misses your context. Tell the AI what’s expected to drift in your environment:

In this org, ASG desired_capacity is managed by autoscaling and always drifts — that’s expected, ignore it. Tags added by the aws-cost-allocation Lambda are also expected. Flag everything else.

# Drift on this is always benign — autoscaling owns desired_capacity
resource "aws_autoscaling_group" "web" {
  # ...
  lifecycle {
    ignore_changes = [desired_capacity]
  }
}

The ignore_changes lifecycle block is often the right permanent fix for recurring benign drift, and the AI is good at suggesting where to add it. That turns a recurring alert into silence at the source — better than re-triaging the same noise weekly.

Pro Tip: Ask the AI which recurring drift should become an ignore_changes rule versus which should stay visible. Suppressing the wrong drift hides real changes; suppressing the right drift kills alert fatigue. Get that boundary explicit and review it as a team.

The INVESTIGATE pile is the whole point

The resources the AI flags as “intentional out-of-band change” are where you earn your salary. For these, the AI helps you ask the right question, not answer it:

This security group rule was removed outside Terraform. What’s the safest next step — codify the removal, or restore it? List what I’d need to know to decide.

The model will tell you: check who made the change and when (CloudTrail), whether there’s an incident ticket, whether the rule was a known vulnerability. It structures the investigation. You make the call, because only you know whether that 3am change was a hero move or a mistake.

Reconcile is human-approved, always

When triage says “reconcile,” the path is a normal, reviewed plan-and-apply — not a cron job:

terraform plan -out=tfplan   # human reads this
# ... a person reviews and approves ...
terraform apply tfplan        # a human runs this

The drift detector and the AI tell you something needs attention. They never resolve it. The instant you let an automated process apply unreviewed changes to reconcile drift, you’ve built a robot that occasionally reverts your incident fixes. Keep apply behind a human, every time.

The boundary

The AI classifies drift, suggests ignore_changes rules, and structures investigations. It does not run apply, does not hold cloud credentials, and does not get state-write access. Its entire output is a triaged list and reviewable HCL suggestions. Drift triage is a perfect AI surface because the action is always gated behind a human — the model makes the noise manageable so a person can focus on the few alerts that represent a real decision.

When drift triage turns into a live incident, the incident response dashboard is built for it, and ongoing alert flows fit the monitoring alerts dashboard. For triage prompt templates, see the prompts library and prompt packs. More under AI for Terraform.

Conclusion

Drift detection’s real problem isn’t detecting drift — it’s the flood of alerts where benign noise hides the one intentional change you must not revert. AI sorts that flood into ignore, reconcile, and investigate, suggests ignore_changes rules to silence recurring noise at the source, and structures the investigations. But it never reconciles: a human approves every apply. The model is the fast junior who reads every alert; you’re the one who decides which drift is a fix and which is a mistake.

Triaging Terraform Drift Alerts With AI Without Blind Reapplies

Why blind reconcile is dangerous