You are a senior platform engineer who has tuned Prometheus TSDB at scale — chosen retention, sized disks, debugged compaction issues. I will provide: - Prometheus version - Current TSDB stats (`prometheus_tsdb_*` metrics) - Symptom (disk full, OOM, slow startup) - Hardware (disk type, RAM) Your job: 1. **TSDB structure**: - **WAL** — write-ahead log; recent samples - **Head** — in-memory recent data - **Blocks** — 2-hour persistent blocks - **Compaction** — merges into larger blocks 2. **Retention**: - `--storage.tsdb.retention.time` (e.g., 30d) - `--storage.tsdb.retention.size` (e.g., 100GB) — disk-based cap - Older data dropped at retention boundary 3. **For disk sizing**: - Roughly: bytes/sample × samples/sec × seconds = bytes - Bytes/sample ~1.3-1.7 with default config - Series count + retention drives total 4. **For OOM**: - Head series count × overhead = RAM usage - Compact more aggressively - Reduce retention - Add nodes (sharding) 5. **For compaction**: - 2h blocks compact to larger blocks over time - Compaction is I/O heavy - `tsdb.max-block-duration` controls max 6. **For WAL**: - Replays on startup → slow startup if large - `--storage.tsdb.wal-segment-size` controls 7. **For backup**: - Snapshot via `/api/v1/admin/tsdb/snapshot` - Block-level; consistent point-in-time 8. **For remote write**: - Offload to long-term storage (Thanos, Cortex) - Local TSDB just needs short retention Mark DESTRUCTIVE: deleting WAL while running (data loss), reducing retention without exporting (history loss), aggressive compaction config impacting query. --- Prometheus version: [DESCRIBE] TSDB stats: ``` [PASTE relevant prometheus_tsdb_* metrics] ``` Symptom: [DESCRIBE] Hardware: [DESCRIBE]

Why this prompt works

TSDB tuning is essential at scale. This prompt walks the basics.

How to use it

Estimate disk from series + retention.
For OOM, look at head series.
Plan compaction window.
Backup via snapshot.

Useful commands

# TSDB stats
curl http://prometheus:9090/api/v1/status/tsdb

# Key metrics
prometheus_tsdb_head_series                    # current head series
prometheus_tsdb_head_chunks                    # head chunks
prometheus_tsdb_head_samples_appended_total
prometheus_tsdb_compactions_total
prometheus_tsdb_wal_corruptions_total
prometheus_tsdb_storage_blocks_bytes
prometheus_local_storage_memory_chunks

# Disk usage
du -sh /prometheus

# Block-level
ls -la /prometheus
# Each subdir = a block

# Snapshot
curl -XPOST http://prometheus:9090/api/v1/admin/tsdb/snapshot
# Returns: {"status":"success","data":{"name":"20260530..."}}
# Snapshot at: /prometheus/snapshots/<name>

Configuration

# Retention by time
--storage.tsdb.retention.time=15d

# Retention by size
--storage.tsdb.retention.size=100GB

# Compaction
--storage.tsdb.max-block-duration=24h        # default; longer = fewer blocks but bigger
--storage.tsdb.min-block-duration=2h          # default

# WAL
--storage.tsdb.wal-compression=true           # default in 2.x

# Out-of-order samples (newer)
--storage.tsdb.out-of-order-time-window=10m

# Path
--storage.tsdb.path=/prometheus

Sizing estimates

Bytes per sample: ~1.5 (with compression)
Samples per scrape: depends on series count
Scrape interval: 30s typical

Daily storage = bytes_per_sample × series_count × (86400/scrape_interval)

Example:
- 1M series × 1.5 bytes × (86400/30) = 4.3 GB / day
- 30 day retention = ~130 GB

Common findings this catches

Disk full → retention too generous OR series count too high.
OOM → head series huge; cardinality reduction.
WAL replay slow → compact often; check on graceful shutdown.
Compaction lag → I/O saturated.
Remote write queue full → downstream slow; back-pressure on Prom.
Block corruption → in WAL; recover from snapshot.
Retention not enforced → time vs size; check both.

When to escalate

Long-term storage planning — Thanos / Mimir.
Multi-Prometheus sharding — strategic.
Hardware sizing — platform team.

Prometheus Storage, Retention & TSDB Prompt

Why this prompt works

How to use it

Useful commands

Configuration

Sizing estimates

Common findings this catches

When to escalate

Related prompts

Prometheus Performance Tuning Prompt

Prometheus Remote Write & Long-term Storage Prompt

PromQL Query Optimization Prompt

Why this prompt works

How to use it

Useful commands

Configuration

Sizing estimates

Common findings this catches

When to escalate

Related prompts

Prometheus Performance Tuning Prompt

Prometheus Remote Write & Long-term Storage Prompt

PromQL Query Optimization Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet