Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Prometheus Storage, Retention & TSDB Prompt

Configure Prometheus TSDB — retention, block size, compaction, WAL, disk sizing, troubleshooting OOM / disk-full.

Target user
Platform engineers operating Prometheus
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior platform engineer who has tuned Prometheus TSDB at scale — chosen retention, sized disks, debugged compaction issues.

I will provide:
- Prometheus version
- Current TSDB stats (`prometheus_tsdb_*` metrics)
- Symptom (disk full, OOM, slow startup)
- Hardware (disk type, RAM)

Your job:

1. **TSDB structure**:
   - **WAL** — write-ahead log; recent samples
   - **Head** — in-memory recent data
   - **Blocks** — 2-hour persistent blocks
   - **Compaction** — merges into larger blocks
2. **Retention**:
   - `--storage.tsdb.retention.time` (e.g., 30d)
   - `--storage.tsdb.retention.size` (e.g., 100GB) — disk-based cap
   - Older data dropped at retention boundary
3. **For disk sizing**:
   - Roughly: bytes/sample × samples/sec × seconds = bytes
   - Bytes/sample ~1.3-1.7 with default config
   - Series count + retention drives total
4. **For OOM**:
   - Head series count × overhead = RAM usage
   - Compact more aggressively
   - Reduce retention
   - Add nodes (sharding)
5. **For compaction**:
   - 2h blocks compact to larger blocks over time
   - Compaction is I/O heavy
   - `tsdb.max-block-duration` controls max
6. **For WAL**:
   - Replays on startup → slow startup if large
   - `--storage.tsdb.wal-segment-size` controls
7. **For backup**:
   - Snapshot via `/api/v1/admin/tsdb/snapshot`
   - Block-level; consistent point-in-time
8. **For remote write**:
   - Offload to long-term storage (Thanos, Cortex)
   - Local TSDB just needs short retention

Mark DESTRUCTIVE: deleting WAL while running (data loss), reducing retention without exporting (history loss), aggressive compaction config impacting query.

---

Prometheus version: [DESCRIBE]
TSDB stats:
```
[PASTE relevant prometheus_tsdb_* metrics]
```
Symptom: [DESCRIBE]
Hardware: [DESCRIBE]

Why this prompt works

TSDB tuning is essential at scale. This prompt walks the basics.

How to use it

  1. Estimate disk from series + retention.
  2. For OOM, look at head series.
  3. Plan compaction window.
  4. Backup via snapshot.

Useful commands

# TSDB stats
curl http://prometheus:9090/api/v1/status/tsdb

# Key metrics
prometheus_tsdb_head_series                    # current head series
prometheus_tsdb_head_chunks                    # head chunks
prometheus_tsdb_head_samples_appended_total
prometheus_tsdb_compactions_total
prometheus_tsdb_wal_corruptions_total
prometheus_tsdb_storage_blocks_bytes
prometheus_local_storage_memory_chunks

# Disk usage
du -sh /prometheus

# Block-level
ls -la /prometheus
# Each subdir = a block

# Snapshot
curl -XPOST http://prometheus:9090/api/v1/admin/tsdb/snapshot
# Returns: {"status":"success","data":{"name":"20260530..."}}
# Snapshot at: /prometheus/snapshots/<name>

Configuration

# Retention by time
--storage.tsdb.retention.time=15d

# Retention by size
--storage.tsdb.retention.size=100GB

# Compaction
--storage.tsdb.max-block-duration=24h        # default; longer = fewer blocks but bigger
--storage.tsdb.min-block-duration=2h          # default

# WAL
--storage.tsdb.wal-compression=true           # default in 2.x

# Out-of-order samples (newer)
--storage.tsdb.out-of-order-time-window=10m

# Path
--storage.tsdb.path=/prometheus

Sizing estimates

Bytes per sample: ~1.5 (with compression)
Samples per scrape: depends on series count
Scrape interval: 30s typical

Daily storage = bytes_per_sample × series_count × (86400/scrape_interval)

Example:
- 1M series × 1.5 bytes × (86400/30) = 4.3 GB / day
- 30 day retention = ~130 GB

Common findings this catches

  • Disk full → retention too generous OR series count too high.
  • OOM → head series huge; cardinality reduction.
  • WAL replay slow → compact often; check on graceful shutdown.
  • Compaction lag → I/O saturated.
  • Remote write queue full → downstream slow; back-pressure on Prom.
  • Block corruption → in WAL; recover from snapshot.
  • Retention not enforced → time vs size; check both.

When to escalate

  • Long-term storage planning — Thanos / Mimir.
  • Multi-Prometheus sharding — strategic.
  • Hardware sizing — platform team.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week