Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPTCursor

Prometheus metric_relabel_configs Drop-List Cardinality Audit Prompt

Audit and generate metric_relabel_configs drop and keep rules that cut high-cardinality series at ingest without dropping metrics your alerts and dashboards depend on.

Target user
Engineers fighting cardinality who need a safe ingest-time drop list
Difficulty
Advanced
Tools
Claude, ChatGPT, Cursor

The prompt

You are a Prometheus operator who treats metric_relabel_configs as a cardinality firewall applied after scrape but before ingestion, and who never drops a series an alert or dashboard depends on.

I will provide:
- The top offending metrics by series count (from `topk(20, count by (__name__)({__name__=~".+"}))`): [TOP SERIES]
- The high-churn labels (e.g. pod, id, path, user_id) and example values: [LABELS]
- The list of alert rules and dashboard queries that touch these metrics: [DEPENDENCIES]
- My scrape job structure (one job vs many) and whether I use kube-prometheus-stack: [JOB CONTEXT]

Your job:

1. **Separate drop-the-metric from drop-the-label** — explain when to use `action: drop` (kill an entire useless metric like a go_gc_* you never query) vs `labeldrop`/`labelmap` (keep the metric, remove an unbounded label). Most cardinality wins come from labeldrop, not drop.

2. **Cross-check against dependencies** — for every proposed drop, state which alert/dashboard query, if any, references that metric or label. Refuse to drop anything referenced; flag it for me to confirm instead.

3. **Write the rules** — produce a metric_relabel_configs block with explicit regex, source_labels, and comments. Prefer keep-lists for noisy exporters (only keep the 30 metrics we use) over endless drop-lists.

4. **Estimate the reduction** — for each rule, give me the query to measure series-before and series-after so I can quantify the win rather than guess.

5. **Order and safety** — explain rule ordering (rules run top to bottom; a drop short-circuits later rules for that series) and where to place this in a kube-prometheus-stack ServiceMonitor vs the Prometheus config.

Output as: (a) a table of proposed changes with the dependency-safety column, (b) the runnable metric_relabel_configs YAML, (c) the before/after measurement queries, (d) a rollout note (apply to staging, watch active series, then promote).

Never drop a metric or label referenced by an existing alert or recording rule. When unsure whether something is used, flag it, do not drop it.

Why this prompt works

Cardinality control is where well-meaning engineers cause outages, because the tool that fixes the problem — metric_relabel_configs — is also a silent data-deletion mechanism. A drop rule produces no error and no warning; it just removes series before they hit the TSDB. If one of those series backed an alert, that alert quietly stops firing, and you discover it during the next incident when the page never came. This prompt makes dependency cross-checking a hard, non-skippable step: every proposed drop must be reconciled against the actual alert and dashboard queries you supply, and anything referenced gets flagged rather than removed.

The prompt also corrects the most common strategic mistake, which is reaching for action: drop on whole metrics when the real cardinality blowup is one unbounded label. Distinguishing drop-the-metric from labeldrop, and biasing toward keep-lists for chatty exporters, is the difference between trimming a few series and actually capping growth. By demanding before/after measurement queries, it also converts cardinality work from superstition (“this should help”) into something you can quantify, which is what you need to justify the change in review.

Finally, ordering and placement are where these configs go wrong in practice — rules short-circuit, and in a kube-prometheus-stack world the right place might be a ServiceMonitor’s metricRelabelings rather than the Prometheus config. Forcing the model to address ordering and a staged rollout keeps this in the AI-drafts, human-verifies lane: you get a concrete YAML block, but also the measurement and rollout discipline to prove it is safe before it touches production ingestion.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week