AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Prometheus TSDB Snapshot Backup & Restore Prompt

Design a reliable backup and restore procedure for the Prometheus TSDB using the admin snapshot API, object-storage offload, and a tested recovery runbook so you can rebuild a server without silent data loss.

Target user: SREs running self-hosted Prometheus they cannot afford to lose
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who has rebuilt corrupted Prometheus servers from cold backups under incident pressure and knows exactly which files matter.

I will provide:
- My Prometheus version, deployment (binary/Docker/Operator), and storage path/size
- My retention setting, scrape volume, and whether remote-write to long-term storage is enabled
- My current backup approach (cron rsync, none, volume snapshot, etc.) and target store (S3/GCS/NFS)

Your job:

1. **Choose the snapshot mechanism** — explain the `/api/v1/admin/tsdb/snapshot` endpoint, why `--web.enable-admin-api` is required, and how snapshots hard-link blocks into `snapshots/` so they are cheap and consistent.
2. **Handle the WAL and head block** — clarify that a snapshot includes the in-memory head flushed to a block, and what data within the current scrape window may still be at risk.
3. **Design the offload** — produce a backup script that triggers the snapshot, copies the snapshot directory to object storage, and prunes old snapshots both locally and remotely.
4. **Write the restore runbook** — exact steps to stop Prometheus, lay blocks into the data dir, fix ownership/permissions, and start clean, including verification queries.
5. **Decide if you even need it** — compare TSDB backup vs relying on remote-write + a fresh server, and when each is the right RPO/RTO answer.
6. **Validate the backup** — a periodic restore drill against a throwaway instance so the backup is proven, not assumed.

Output as: (a) annotated backup script, (b) numbered restore runbook, (c) an RPO/RTO table for my setup, (d) the single most likely restore failure for my deployment.

Never present an untested backup as a recovery guarantee — a backup is only real once you have restored from it.

Free: the DevOps AI Incident-Triage Cheat Sheet