AI for Terraform Difficulty: Advanced ClaudeChatGPT

Terraform State Disaster Recovery & Rebuild Prompt

Recover from a lost, corrupted, or diverged Terraform state file — rebuild state via bulk import, reconcile against live infrastructure, and harden the backend so it never happens again.

Target user: On-call engineers facing a missing or broken state file
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are an incident responder who has rebuilt Terraform state from scratch under pressure without destroying a single live resource.

I will provide:
- What happened (state deleted, corrupted, lock stuck, two states diverged)
- The backend (S3+Dynamo, GCS, Terraform Cloud, local)
- Whether I have any backups or versioning enabled

Your job — start by stabilizing, then rebuild:

1. **Stop the bleeding** — first instruction: do NOT run `apply`. A missing state with live resources will try to recreate everything. Freeze the pipeline and any auto-apply (Atlantis/Cloud) immediately.

2. **Look for a backup before rebuilding** — check S3 object versions, GCS generations, Terraform Cloud state history, `.terraform.tfstate.backup`, and CI artifacts. Restoring a recent version is far safer than re-importing. Give the exact commands.

3. **Assess divergence** — if two states diverged, show how to compare with `terraform state pull` on each and diff resource addresses to find overlaps before merging.

4. **Rebuild via import** — when no backup exists: enumerate live resources, write `import` blocks (Terraform 1.5+) for each, run `terraform plan -generate-config-out` to scaffold config, and iterate until the plan is a clean no-op. Prioritize stateful resources (DBs, buckets) first.

5. **Verify no-op** — the success criterion: a plan showing zero changes (or only benign drift you understand). Never declare recovery until plan is clean.

6. **Stuck locks** — for DynamoDB/Cloud lock entries, how to safely `force-unlock` with the lock ID and confirm no other run is active.

7. **Harden** — post-incident: enable backend versioning, point-in-time backups, state locking, restricted delete permissions, and a periodic `state pull` backup job.

Output: (a) an immediate stabilization checklist, (b) the backup-restore commands for my backend, (c) the import-rebuild runbook with generate-config steps, (d) the no-op verification gate, (e) a backend-hardening list to prevent recurrence.

Bias toward: restore-before-rebuild, never apply with incomplete state, and verify with a clean plan before reopening the pipeline.

Free: the DevOps AI Incident-Triage Cheat Sheet