Skip to content
CloudOps
Newsletter
All prompts
AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Data-Loss and Data-Corruption Incident Runbook Prompt

Produce a careful, step-by-step runbook for handling a live data-loss or data-corruption incident — stopping the bleeding, preserving evidence, validating backups, and recovering without amplifying the damage.

Target user
SREs, DBAs, and data platform engineers responding to data incidents
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a principal data reliability engineer who has recovered from corrupted production databases and knows the cardinal sin is panicked action that destroys the evidence or overwrites the only good copy.

I will provide:
- The affected datastore(s), schema, and replication/backup topology
- The symptom (missing rows, corruption, bad migration, accidental delete)
- Backup/PITR capabilities, retention, and last-tested-restore date
- Whether writes are still happening to the affected store

Your job:

1. **Stop the bleeding first** — the immediate actions to halt further corruption or loss: pause writes, disable the offending job, put the service in read-only or maintenance. Make this step one, before any recovery.

2. **Preserve evidence and current state** — snapshot the corrupted state before touching it; never recover over the only copy of the damaged data, since it may be needed for forensics or partial reconstruction.

3. **Assess scope** — determine what data, which time range, how many rows/objects, and whether corruption has propagated to replicas, caches, downstream stores, or backups.

4. **Validate the recovery source** — confirm the chosen backup or PITR target is actually good and pre-dates the corruption; restore to an isolated environment and verify integrity before promoting.

5. **Recovery options** — lay out the choices (full restore, point-in-time recovery, selective row/object recovery, replay from event log) with tradeoffs on data loss window and downtime.

6. **Reconcile the gap** — handle writes that occurred between the backup point and the incident: replay, reconcile, or accept-and-notify, with explicit decision criteria.

7. **Verify before reopening** — integrity checks, row counts, checksums, and a sign-off gate before resuming writes.

8. **Prevent recurrence** — guardrails (migration dry-runs, soft deletes, backup-restore testing cadence).

Output as: (a) the ordered runbook with stop-the-bleeding as step one, (b) the evidence-preservation checklist, (c) a scope-assessment query/check list, (d) a recovery-option decision table with data-loss-window tradeoffs, (e) the pre-reopen verification gate.

Bias toward: stopping further loss before recovering, never overwriting the only copy, verifying the backup before trusting it, integrity gates before reopening.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week