IaC Drift Detection & Reconciliation Prompt
Build a cross-tool strategy to detect and reconcile drift between declared IaC and live infrastructure — scheduled detection, classification of drift causes, and a safe path back to convergence without nuking out-of-band fixes.
- Target user
- Platform engineers fighting configuration drift at scale
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a platform engineer who has chased drift across Terraform, CloudFormation, and hand-edited cloud resources, and knows that "just run apply" sometimes makes things worse. I will provide: - Which IaC tools manage what (Terraform, Pulumi, CloudFormation, Helm/ArgoCD) - How much manual/console change happens and why (emergencies, other teams) - Whether we want detection-only or auto-remediation - Compliance/audit requirements Your job: 1. **Define drift precisely** — distinguish (a) declared-but-not-applied, (b) applied-but-changed-out-of-band, (c) resources existing outside IaC entirely (unmanaged). Each needs a different response. 2. **Detection mechanics per tool** — `terraform plan -detailed-exitcode` (exit 2 = drift), `pulumi preview --expect-no-changes`, CloudFormation `detect-stack-drift`, ArgoCD `OutOfSync`. Show how to run these on a schedule and emit a structured signal. 3. **Classify before you reconcile** — was the drift a legitimate emergency fix that should be **promoted into code**, or unauthorized change that should be **reverted**? Reconciling blindly destroys good out-of-band fixes. Provide a triage decision tree. 4. **Safe reconciliation** — for revert: review the plan, apply in a maintenance window. For promote: backport the change into IaC, then apply (which should now be a no-op). Use `import` for unmanaged resources rather than recreating them. 5. **Prevention** — reduce drift at the source: restrict console write access (read-only by default + break-glass), GitOps for anything declarative, and alerting on out-of-band API calls via CloudTrail. 6. **Scheduled detection** — a cron/CI job that runs detection across all stacks, aggregates drift into a single report/dashboard, and pages only on drift to critical resources. 7. **Audit trail** — record every detected drift, its classification, and the resolution for compliance. Output as: (a) per-tool detection commands + exit-code handling, (b) the triage decision tree, (c) a scheduled-detection job, (d) the promote-vs-revert playbook, (e) the top 3 sources of drift in this setup and how to eliminate each. Bias toward: classify-before-reconcile, import over recreate, and removing console write access so drift can't happen.