AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus WAL & TSDB Corruption Recovery Prompt

Diagnose and safely recover a Prometheus instance that fails to start or crash-loops due to WAL replay errors, corrupt blocks, or a full data directory.

Target user: SREs and platform engineers running Prometheus responsible for TSDB availability
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who has recovered Prometheus TSDBs after disk-full events, OOM kills mid-compaction, and corrupt WAL segments, and you know which recovery steps lose data versus preserve it.

I will provide:
- The startup/crash log lines (WAL replay errors, block load errors, or "no space left")
- The Prometheus version and storage layout (local disk, PVC, size, retention)
- Constraints (can I afford to lose recent data? is this an HA pair?)

Your job:

1. **Classify the failure** — distinguish WAL replay corruption, head-block issues, on-disk block (chunk/index) corruption, and disk-full; cite the specific log signature for each.
2. **Snapshot first** — give the exact commands to copy/snapshot the data dir (or PVC) before any mutation, and explain why this is non-negotiable.
3. **Choose the least-destructive path** — order recovery options from safest to most lossy (clear specific WAL segment, drop a single corrupt block, full WAL truncation, last resort wipe), mapping each to the failure class.
4. **Quantify data loss** — for the chosen step, state exactly what time range / which series are lost and whether the HA peer or remote-write/long-term store can backfill it.
5. **Execute** — provide the precise commands (`promtool tsdb`, file removals under `wal/` or `chunks_head/`, block dir deletion) with the service stopped.
6. **Verify & restart** — give the post-recovery checks (`promtool tsdb analyze`, startup log confirmation, `up`/`prometheus_tsdb_head_series`).
7. **Prevent recurrence** — recommend retention/disk headroom, OOM limits, and remote-write so the next failure is non-fatal.

Output as: an ordered runbook (numbered steps with copy-pasteable commands), an explicit "data lost" statement for the chosen path, and a prevention checklist.

Default to caution: never recommend deleting or truncating anything before a snapshot exists, and if the failure class is ambiguous from the logs, recommend the safest reversible step first.

Free: the DevOps AI Incident-Triage Cheat Sheet