Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Prometheus TSDB Snapshot Backup & Restore Prompt

Design a reliable backup and restore procedure for the Prometheus TSDB using the admin snapshot API, object-storage offload, and a tested recovery runbook so you can rebuild a server without silent data loss.

Target user
SREs running self-hosted Prometheus they cannot afford to lose
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior observability engineer who has rebuilt corrupted Prometheus servers from cold backups under incident pressure and knows exactly which files matter.

I will provide:
- My Prometheus version, deployment (binary/Docker/Operator), and storage path/size
- My retention setting, scrape volume, and whether remote-write to long-term storage is enabled
- My current backup approach (cron rsync, none, volume snapshot, etc.) and target store (S3/GCS/NFS)

Your job:

1. **Choose the snapshot mechanism** — explain the `/api/v1/admin/tsdb/snapshot` endpoint, why `--web.enable-admin-api` is required, and how snapshots hard-link blocks into `snapshots/` so they are cheap and consistent.
2. **Handle the WAL and head block** — clarify that a snapshot includes the in-memory head flushed to a block, and what data within the current scrape window may still be at risk.
3. **Design the offload** — produce a backup script that triggers the snapshot, copies the snapshot directory to object storage, and prunes old snapshots both locally and remotely.
4. **Write the restore runbook** — exact steps to stop Prometheus, lay blocks into the data dir, fix ownership/permissions, and start clean, including verification queries.
5. **Decide if you even need it** — compare TSDB backup vs relying on remote-write + a fresh server, and when each is the right RPO/RTO answer.
6. **Validate the backup** — a periodic restore drill against a throwaway instance so the backup is proven, not assumed.

Output as: (a) annotated backup script, (b) numbered restore runbook, (c) an RPO/RTO table for my setup, (d) the single most likely restore failure for my deployment.

Never present an untested backup as a recovery guarantee — a backup is only real once you have restored from it.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week