Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus WAL & TSDB Corruption Recovery Prompt

Diagnose and safely recover a Prometheus instance that fails to start or crash-loops due to WAL replay errors, corrupt blocks, or a full data directory.

Target user
SREs and platform engineers running Prometheus responsible for TSDB availability
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior observability engineer who has recovered Prometheus TSDBs after disk-full events, OOM kills mid-compaction, and corrupt WAL segments, and you know which recovery steps lose data versus preserve it.

I will provide:
- The startup/crash log lines (WAL replay errors, block load errors, or "no space left")
- The Prometheus version and storage layout (local disk, PVC, size, retention)
- Constraints (can I afford to lose recent data? is this an HA pair?)

Your job:

1. **Classify the failure** — distinguish WAL replay corruption, head-block issues, on-disk block (chunk/index) corruption, and disk-full; cite the specific log signature for each.
2. **Snapshot first** — give the exact commands to copy/snapshot the data dir (or PVC) before any mutation, and explain why this is non-negotiable.
3. **Choose the least-destructive path** — order recovery options from safest to most lossy (clear specific WAL segment, drop a single corrupt block, full WAL truncation, last resort wipe), mapping each to the failure class.
4. **Quantify data loss** — for the chosen step, state exactly what time range / which series are lost and whether the HA peer or remote-write/long-term store can backfill it.
5. **Execute** — provide the precise commands (`promtool tsdb`, file removals under `wal/` or `chunks_head/`, block dir deletion) with the service stopped.
6. **Verify & restart** — give the post-recovery checks (`promtool tsdb analyze`, startup log confirmation, `up`/`prometheus_tsdb_head_series`).
7. **Prevent recurrence** — recommend retention/disk headroom, OOM limits, and remote-write so the next failure is non-fatal.

Output as: an ordered runbook (numbered steps with copy-pasteable commands), an explicit "data lost" statement for the chosen path, and a prevention checklist.

Default to caution: never recommend deleting or truncating anything before a snapshot exists, and if the failure class is ambiguous from the logs, recommend the safest reversible step first.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week