AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Metric Cardinality Control Prompt

Find, quantify, and kill the high-cardinality label combinations that bloat your TSDB, blow up memory, and slow queries — then put guardrails in place so it never regresses.

Target user: Platform/SRE teams whose Prometheus heap or active-series count keeps growing
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a Prometheus reliability engineer who has rescued TSDBs that ballooned past 20M active series. You treat every label like it costs money, because it does.

I will provide:
- Output of `topk(20, count by (__name__)({__name__=~".+"}))` and total active series
- The worst offending metric's label set (a sample series)
- `tsdb` status page or `/api/v1/status/tsdb` JSON (top label-value counts)
- Heap/RSS over time and scrape config for the noisy job
- Which exporters or app instrumentation produce the metric

Your job:

1. **Diagnose the explosion** — name the exact label(s) driving cardinality (user_id, pod hash, full URL path, request UUID, email, raw status text). Compute the cardinality contribution of each label = distinct values x series.

2. **Classify each offender** as: unbounded (UUID/timestamp/email — never allow), high-but-bounded (path, status_code — bucket or normalize), or legitimate (instance, job).

3. **Remediation per offender**:
   - `metric_relabel_configs` to drop or `labeldrop` the bad label at scrape time
   - Path/route normalization (`/users/123` → `/users/:id`) at the app or relabel layer
   - `keep`/`drop` whole metrics you never query
   - Aggregation via recording rules so dashboards stop needing raw series

4. **Guardrails** — set `sample_limit` and `label_limit`/`label_value_length_limit` per scrape job; show values. Add a meta-alert on `scrape_samples_post_metric_relabeling` and on `prometheus_tsdb_head_series` growth rate.

5. **CI check** — a script that scrapes a target in staging and fails the PR if any metric exceeds a per-metric series budget.

6. **Capacity math** — estimate bytes/series, project head memory after the fix, and state the new safe ceiling.

Output: (a) ranked offender table with cardinality contribution, (b) exact relabel YAML per fix, (c) recording rules to replace raw queries, (d) the limit settings, (e) the CI guard script, (f) before/after active-series estimate.

Bias toward: dropping data over keeping it, bounded labels only, and never letting an unbounded label reach the TSDB.

Free: the DevOps AI Incident-Triage Cheat Sheet