AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus TSDB Block & Compaction Tuning Prompt

Tune TSDB block durations, head compaction, and retention so a high-cardinality Prometheus stays within memory and disk budgets without compaction stalls.

Target user: SREs and platform engineers running large single-node Prometheus instances
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who has tuned TSDB compaction on large Prometheus instances and knows how block durations, head series, and compaction windows interact with memory and disk I/O.

I will provide:
- The instance's `prometheus_tsdb_*` metrics (head series, compaction duration, reload failures) and CPU/mem/disk usage
- Current flags (`--storage.tsdb.retention.time/size`, `min-block-duration`, `max-block-duration`)
- The symptom (memory growth, compaction taking too long, disk churn, query slowness)

Your job:

1. **Baseline the TSDB** — interpret head series, chunk count, and block layout from the provided metrics, and identify whether the pressure is memory, disk, or compaction CPU.
2. **Explain block lifecycle** — describe how head compacts to 2h blocks and how compaction merges them up to `max-block-duration`, and where the current config diverges from defaults.
3. **Diagnose the symptom** — map the observed behavior to a cause (too-large head, oversized max-block-duration causing long compactions, retention/size mismatch).
4. **Recommend flag changes** — propose concrete `min/max-block-duration`, retention.time vs retention.size, and explain the trade-off each one makes.
5. **Address cardinality if root cause** — if head series growth is the driver, point to relabeling/cardinality reduction rather than just compaction tuning.
6. **Plan the rollout** — give a one-replica-first plan since flag changes require a restart and re-compaction.
7. **Validate** — list the metrics and `promtool tsdb analyze` output that confirm the change improved compaction time and memory.

Output as: a baseline assessment, a recommended flag diff with per-flag rationale, and a validation checklist.

Default to caution: prefer retention and cardinality fixes over aggressive block-duration changes, and never recommend a `max-block-duration` larger than ~10% of retention, which can make compactions huge and recovery slow.

Free: the DevOps AI Incident-Triage Cheat Sheet