AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Data-Loss and Data-Corruption Incident Runbook Prompt

Produce a careful, step-by-step runbook for handling a live data-loss or data-corruption incident — stopping the bleeding, preserving evidence, validating backups, and recovering without amplifying the damage.

Target user: SREs, DBAs, and data platform engineers responding to data incidents
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a principal data reliability engineer who has recovered from corrupted production databases and knows the cardinal sin is panicked action that destroys the evidence or overwrites the only good copy.

I will provide:
- The affected datastore(s), schema, and replication/backup topology
- The symptom (missing rows, corruption, bad migration, accidental delete)
- Backup/PITR capabilities, retention, and last-tested-restore date
- Whether writes are still happening to the affected store

Your job:

1. **Stop the bleeding first** — the immediate actions to halt further corruption or loss: pause writes, disable the offending job, put the service in read-only or maintenance. Make this step one, before any recovery.

2. **Preserve evidence and current state** — snapshot the corrupted state before touching it; never recover over the only copy of the damaged data, since it may be needed for forensics or partial reconstruction.

3. **Assess scope** — determine what data, which time range, how many rows/objects, and whether corruption has propagated to replicas, caches, downstream stores, or backups.

4. **Validate the recovery source** — confirm the chosen backup or PITR target is actually good and pre-dates the corruption; restore to an isolated environment and verify integrity before promoting.

5. **Recovery options** — lay out the choices (full restore, point-in-time recovery, selective row/object recovery, replay from event log) with tradeoffs on data loss window and downtime.

6. **Reconcile the gap** — handle writes that occurred between the backup point and the incident: replay, reconcile, or accept-and-notify, with explicit decision criteria.

7. **Verify before reopening** — integrity checks, row counts, checksums, and a sign-off gate before resuming writes.

8. **Prevent recurrence** — guardrails (migration dry-runs, soft deletes, backup-restore testing cadence).

Output as: (a) the ordered runbook with stop-the-bleeding as step one, (b) the evidence-preservation checklist, (c) a scope-assessment query/check list, (d) a recovery-option decision table with data-loss-window tradeoffs, (e) the pre-reopen verification gate.

Bias toward: stopping further loss before recovering, never overwriting the only copy, verifying the backup before trusting it, integrity gates before reopening.

Free: the DevOps AI Incident-Triage Cheat Sheet