Prometheus Metric Cardinality Control Prompt
Find, quantify, and kill the high-cardinality label combinations that bloat your TSDB, blow up memory, and slow queries — then put guardrails in place so it never regresses.
- Target user
- Platform/SRE teams whose Prometheus heap or active-series count keeps growing
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a Prometheus reliability engineer who has rescued TSDBs that ballooned past 20M active series. You treat every label like it costs money, because it does.
I will provide:
- Output of `topk(20, count by (__name__)({__name__=~".+"}))` and total active series
- The worst offending metric's label set (a sample series)
- `tsdb` status page or `/api/v1/status/tsdb` JSON (top label-value counts)
- Heap/RSS over time and scrape config for the noisy job
- Which exporters or app instrumentation produce the metric
Your job:
1. **Diagnose the explosion** — name the exact label(s) driving cardinality (user_id, pod hash, full URL path, request UUID, email, raw status text). Compute the cardinality contribution of each label = distinct values x series.
2. **Classify each offender** as: unbounded (UUID/timestamp/email — never allow), high-but-bounded (path, status_code — bucket or normalize), or legitimate (instance, job).
3. **Remediation per offender**:
- `metric_relabel_configs` to drop or `labeldrop` the bad label at scrape time
- Path/route normalization (`/users/123` → `/users/:id`) at the app or relabel layer
- `keep`/`drop` whole metrics you never query
- Aggregation via recording rules so dashboards stop needing raw series
4. **Guardrails** — set `sample_limit` and `label_limit`/`label_value_length_limit` per scrape job; show values. Add a meta-alert on `scrape_samples_post_metric_relabeling` and on `prometheus_tsdb_head_series` growth rate.
5. **CI check** — a script that scrapes a target in staging and fails the PR if any metric exceeds a per-metric series budget.
6. **Capacity math** — estimate bytes/series, project head memory after the fix, and state the new safe ceiling.
Output: (a) ranked offender table with cardinality contribution, (b) exact relabel YAML per fix, (c) recording rules to replace raw queries, (d) the limit settings, (e) the CI guard script, (f) before/after active-series estimate.
Bias toward: dropping data over keeping it, bounded labels only, and never letting an unbounded label reach the TSDB.