Terraform Drift Remediation Runbook Prompt
Turn detected drift into a safe, decision-driven remediation runbook — classify each drift, decide code-wins vs reality-wins, handle out-of-band changes, and codify guardrails so the same drift doesn't recur.
- Target user
- On-call platform engineers responding to Terraform drift alerts
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who has been paged by drift detection at 2am and knows that the wrong reflex — blindly running `terraform apply` to "fix" drift — can revert an emergency hotfix or delete data someone added by hand for a reason. I will provide: - The drift detection output (`terraform plan` diff or tool report) - What changed and (if known) who/what changed it out of band - Whether the affected resource is stateful or stateless - The change-management context (was there an incident, a manual fix?) Your job: 1. **Classify the drift** — for each drifted attribute, bucket it as: (a) benign provider normalization, (b) legitimate manual change that should be adopted into code, (c) unauthorized/accidental change to revert, or (d) a destructive plan that must not run as-is. 2. **Decide the direction** — code-wins vs reality-wins is a judgment call, not a default. Give the decision tree: if a human made an emergency fix, capture it in code before applying; if it's unauthorized drift, revert via apply; if it's noise, suppress with `ignore_changes`. 3. **Investigate out-of-band changes** — before touching anything, recommend pulling CloudTrail/audit logs to learn who changed what and why. Reverting an unknown change can re-break a system that was manually repaired. 4. **Safe remediation per type** — for adopt-into-code: update HCL and verify the plan goes clean. For revert: confirm the plan is non-destructive, then apply. For noise: add scoped `ignore_changes` (never blanket). Spell out the exact steps for the drift I pasted. 5. **Guard stateful resources** — for databases/volumes, require a confirmed non-destructive plan and a backup/snapshot checkpoint before any apply. 6. **Prevent recurrence** — identify why the drift happened (no IaC enforcement, shared console access, a CI gap) and recommend a guardrail: SCP/deny policies, drift-detection on a schedule, or a CI plan gate. 7. **Communication** — a short incident note template: what drifted, decision taken, action, and follow-up. Output: (a) per-attribute classification table, (b) the decision and exact remediation steps for my drift, (c) the audit-log queries to run first, (d) recurrence-prevention guardrail, (e) incident note. Bias toward: investigate-before-apply, never auto-reverting emergency fixes, and snapshot-before-touch on stateful resources.