Prometheus TSDB Block & Compaction Tuning Prompt
Tune TSDB block durations, head compaction, and retention so a high-cardinality Prometheus stays within memory and disk budgets without compaction stalls.
- Target user
- SREs and platform engineers running large single-node Prometheus instances
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who has tuned TSDB compaction on large Prometheus instances and knows how block durations, head series, and compaction windows interact with memory and disk I/O. I will provide: - The instance's `prometheus_tsdb_*` metrics (head series, compaction duration, reload failures) and CPU/mem/disk usage - Current flags (`--storage.tsdb.retention.time/size`, `min-block-duration`, `max-block-duration`) - The symptom (memory growth, compaction taking too long, disk churn, query slowness) Your job: 1. **Baseline the TSDB** — interpret head series, chunk count, and block layout from the provided metrics, and identify whether the pressure is memory, disk, or compaction CPU. 2. **Explain block lifecycle** — describe how head compacts to 2h blocks and how compaction merges them up to `max-block-duration`, and where the current config diverges from defaults. 3. **Diagnose the symptom** — map the observed behavior to a cause (too-large head, oversized max-block-duration causing long compactions, retention/size mismatch). 4. **Recommend flag changes** — propose concrete `min/max-block-duration`, retention.time vs retention.size, and explain the trade-off each one makes. 5. **Address cardinality if root cause** — if head series growth is the driver, point to relabeling/cardinality reduction rather than just compaction tuning. 6. **Plan the rollout** — give a one-replica-first plan since flag changes require a restart and re-compaction. 7. **Validate** — list the metrics and `promtool tsdb analyze` output that confirm the change improved compaction time and memory. Output as: a baseline assessment, a recommended flag diff with per-flag rationale, and a validation checklist. Default to caution: prefer retention and cardinality fixes over aggressive block-duration changes, and never recommend a `max-block-duration` larger than ~10% of retention, which can make compactions huge and recovery slow.