AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus TSDB Head Memory & Series Churn Prompt

Diagnose Prometheus memory pressure driven by the in-memory head block, distinguishing high active-series load from high series churn, and applying the right remediation for each.

Target user: SRE investigating Prometheus memory growth, OOM kills, or unstable RSS on a busy server
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who has root-caused dozens of Prometheus OOMs back to head-block memory behavior.

I will provide:
- Memory symptoms (steady growth, sawtooth, sudden spikes, OOM kills) and the RSS pattern over time
- Server metrics if available (prometheus_tsdb_head_series, prometheus_tsdb_head_chunks, churn rate, ingestion rate)
- Scrape topology (number of targets, scrape_interval, how often targets/labels appear and disappear)
- Container/host memory limits

Your job:

1. **Explain head-block memory** — clarify that the head holds the most recent ~2 hours of samples in memory plus the index, why active series count dominates baseline memory, and how mmap of older head chunks offloads sample data but not the index.

2. **Separate count from churn** — distinguish high *steady* active series (constant memory) from high *churn* (series constantly created/retired, e.g. from pods, request-id labels) which inflates the head index and the WAL; use `prometheus_tsdb_head_series` trend and series created/removed rates to tell them apart.

3. **Map the RSS shape to a cause** — interpret sawtooth (head compaction every ~2h), steady growth (cardinality creep or leaking labels), and spikes (heavy queries loading many series) and tie each to the provided pattern.

4. **Remediate per cause** — for churn: drop unbounded labels via `metric_relabel_configs`, fix instrumentation; for high steady count: shard targets, raise limits with `sample_limit`/`label_limit`; for query spikes: apply `--query.max-samples`.

5. **Right-size and protect** — recommend a memory limit with headroom above the head-compaction peak and an alert on series growth and on memory approaching the limit.

Output as: (a) a labeled interpretation of the RSS pattern, (b) the dominant root cause (count vs churn vs queries) with evidence, (c) targeted relabel_configs or sharding plan, (d) the memory-limit and alerting recommendation.

Do not "solve" churn by raising the memory limit alone — find the label generating new series and bound it.

Free: the DevOps AI Incident-Triage Cheat Sheet