Prometheus Storage, Retention & TSDB Prompt
Configure Prometheus TSDB — retention, block size, compaction, WAL, disk sizing, troubleshooting OOM / disk-full.
- Target user
- Platform engineers operating Prometheus
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has tuned Prometheus TSDB at scale — chosen retention, sized disks, debugged compaction issues. I will provide: - Prometheus version - Current TSDB stats (`prometheus_tsdb_*` metrics) - Symptom (disk full, OOM, slow startup) - Hardware (disk type, RAM) Your job: 1. **TSDB structure**: - **WAL** — write-ahead log; recent samples - **Head** — in-memory recent data - **Blocks** — 2-hour persistent blocks - **Compaction** — merges into larger blocks 2. **Retention**: - `--storage.tsdb.retention.time` (e.g., 30d) - `--storage.tsdb.retention.size` (e.g., 100GB) — disk-based cap - Older data dropped at retention boundary 3. **For disk sizing**: - Roughly: bytes/sample × samples/sec × seconds = bytes - Bytes/sample ~1.3-1.7 with default config - Series count + retention drives total 4. **For OOM**: - Head series count × overhead = RAM usage - Compact more aggressively - Reduce retention - Add nodes (sharding) 5. **For compaction**: - 2h blocks compact to larger blocks over time - Compaction is I/O heavy - `tsdb.max-block-duration` controls max 6. **For WAL**: - Replays on startup → slow startup if large - `--storage.tsdb.wal-segment-size` controls 7. **For backup**: - Snapshot via `/api/v1/admin/tsdb/snapshot` - Block-level; consistent point-in-time 8. **For remote write**: - Offload to long-term storage (Thanos, Cortex) - Local TSDB just needs short retention Mark DESTRUCTIVE: deleting WAL while running (data loss), reducing retention without exporting (history loss), aggressive compaction config impacting query. --- Prometheus version: [DESCRIBE] TSDB stats: ``` [PASTE relevant prometheus_tsdb_* metrics] ``` Symptom: [DESCRIBE] Hardware: [DESCRIBE]
Why this prompt works
TSDB tuning is essential at scale. This prompt walks the basics.
How to use it
- Estimate disk from series + retention.
- For OOM, look at head series.
- Plan compaction window.
- Backup via snapshot.
Useful commands
# TSDB stats
curl http://prometheus:9090/api/v1/status/tsdb
# Key metrics
prometheus_tsdb_head_series # current head series
prometheus_tsdb_head_chunks # head chunks
prometheus_tsdb_head_samples_appended_total
prometheus_tsdb_compactions_total
prometheus_tsdb_wal_corruptions_total
prometheus_tsdb_storage_blocks_bytes
prometheus_local_storage_memory_chunks
# Disk usage
du -sh /prometheus
# Block-level
ls -la /prometheus
# Each subdir = a block
# Snapshot
curl -XPOST http://prometheus:9090/api/v1/admin/tsdb/snapshot
# Returns: {"status":"success","data":{"name":"20260530..."}}
# Snapshot at: /prometheus/snapshots/<name>
Configuration
# Retention by time
--storage.tsdb.retention.time=15d
# Retention by size
--storage.tsdb.retention.size=100GB
# Compaction
--storage.tsdb.max-block-duration=24h # default; longer = fewer blocks but bigger
--storage.tsdb.min-block-duration=2h # default
# WAL
--storage.tsdb.wal-compression=true # default in 2.x
# Out-of-order samples (newer)
--storage.tsdb.out-of-order-time-window=10m
# Path
--storage.tsdb.path=/prometheus
Sizing estimates
Bytes per sample: ~1.5 (with compression)
Samples per scrape: depends on series count
Scrape interval: 30s typical
Daily storage = bytes_per_sample × series_count × (86400/scrape_interval)
Example:
- 1M series × 1.5 bytes × (86400/30) = 4.3 GB / day
- 30 day retention = ~130 GB
Common findings this catches
- Disk full → retention too generous OR series count too high.
- OOM → head series huge; cardinality reduction.
- WAL replay slow → compact often; check on graceful shutdown.
- Compaction lag → I/O saturated.
- Remote write queue full → downstream slow; back-pressure on Prom.
- Block corruption → in WAL; recover from snapshot.
- Retention not enforced → time vs size; check both.
When to escalate
- Long-term storage planning — Thanos / Mimir.
- Multi-Prometheus sharding — strategic.
- Hardware sizing — platform team.
Related prompts
-
Prometheus Performance Tuning Prompt
Tune Prometheus performance — head series, memory, query timeout, max samples, ingestion rate, expensive queries.
-
Prometheus Remote Write & Long-term Storage Prompt
Configure remote write to long-term storage — Thanos Receive, Cortex/Mimir, VictoriaMetrics, troubleshoot queue/backlog/back-pressure.
-
PromQL Query Optimization Prompt
Diagnose slow PromQL queries — cardinality explosion, range vector traps, sum vs avg pitfalls, query timeout, recording rules opportunity.