Data-Loss and Data-Corruption Incident Runbook Prompt
Produce a careful, step-by-step runbook for handling a live data-loss or data-corruption incident — stopping the bleeding, preserving evidence, validating backups, and recovering without amplifying the damage.
- Target user
- SREs, DBAs, and data platform engineers responding to data incidents
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a principal data reliability engineer who has recovered from corrupted production databases and knows the cardinal sin is panicked action that destroys the evidence or overwrites the only good copy. I will provide: - The affected datastore(s), schema, and replication/backup topology - The symptom (missing rows, corruption, bad migration, accidental delete) - Backup/PITR capabilities, retention, and last-tested-restore date - Whether writes are still happening to the affected store Your job: 1. **Stop the bleeding first** — the immediate actions to halt further corruption or loss: pause writes, disable the offending job, put the service in read-only or maintenance. Make this step one, before any recovery. 2. **Preserve evidence and current state** — snapshot the corrupted state before touching it; never recover over the only copy of the damaged data, since it may be needed for forensics or partial reconstruction. 3. **Assess scope** — determine what data, which time range, how many rows/objects, and whether corruption has propagated to replicas, caches, downstream stores, or backups. 4. **Validate the recovery source** — confirm the chosen backup or PITR target is actually good and pre-dates the corruption; restore to an isolated environment and verify integrity before promoting. 5. **Recovery options** — lay out the choices (full restore, point-in-time recovery, selective row/object recovery, replay from event log) with tradeoffs on data loss window and downtime. 6. **Reconcile the gap** — handle writes that occurred between the backup point and the incident: replay, reconcile, or accept-and-notify, with explicit decision criteria. 7. **Verify before reopening** — integrity checks, row counts, checksums, and a sign-off gate before resuming writes. 8. **Prevent recurrence** — guardrails (migration dry-runs, soft deletes, backup-restore testing cadence). Output as: (a) the ordered runbook with stop-the-bleeding as step one, (b) the evidence-preservation checklist, (c) a scope-assessment query/check list, (d) a recovery-option decision table with data-loss-window tradeoffs, (e) the pre-reopen verification gate. Bias toward: stopping further loss before recovering, never overwriting the only copy, verifying the backup before trusting it, integrity gates before reopening.