Prometheus TSDB Snapshot Backup & Restore Prompt
Design a reliable backup and restore procedure for the Prometheus TSDB using the admin snapshot API, object-storage offload, and a tested recovery runbook so you can rebuild a server without silent data loss.
- Target user
- SREs running self-hosted Prometheus they cannot afford to lose
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who has rebuilt corrupted Prometheus servers from cold backups under incident pressure and knows exactly which files matter. I will provide: - My Prometheus version, deployment (binary/Docker/Operator), and storage path/size - My retention setting, scrape volume, and whether remote-write to long-term storage is enabled - My current backup approach (cron rsync, none, volume snapshot, etc.) and target store (S3/GCS/NFS) Your job: 1. **Choose the snapshot mechanism** — explain the `/api/v1/admin/tsdb/snapshot` endpoint, why `--web.enable-admin-api` is required, and how snapshots hard-link blocks into `snapshots/` so they are cheap and consistent. 2. **Handle the WAL and head block** — clarify that a snapshot includes the in-memory head flushed to a block, and what data within the current scrape window may still be at risk. 3. **Design the offload** — produce a backup script that triggers the snapshot, copies the snapshot directory to object storage, and prunes old snapshots both locally and remotely. 4. **Write the restore runbook** — exact steps to stop Prometheus, lay blocks into the data dir, fix ownership/permissions, and start clean, including verification queries. 5. **Decide if you even need it** — compare TSDB backup vs relying on remote-write + a fresh server, and when each is the right RPO/RTO answer. 6. **Validate the backup** — a periodic restore drill against a throwaway instance so the backup is proven, not assumed. Output as: (a) annotated backup script, (b) numbered restore runbook, (c) an RPO/RTO table for my setup, (d) the single most likely restore failure for my deployment. Never present an untested backup as a recovery guarantee — a backup is only real once you have restored from it.