metric_relabel_configs as a Cardinality Firewall

Most Prometheus relabeling advice focuses on relabel_configs — the rules that run before a scrape to decide which targets to scrape and what their target labels are. Less discussed, and more dangerous, is metric_relabel_configs, which runs after the scrape but before ingestion. This is your last line of defense against cardinality: a firewall that inspects every sample coming off a target and decides whether it lives or dies. Used well, it caps the growth of a chatty exporter without touching anything you query. Used carelessly, it silently deletes the series an alert depends on, and you find out during the next incident when the page never comes.

Where it sits and what it can do

metric_relabel_configs operates on the time series after they’ve been scraped, so it has access to __name__ and every label. The actions you’ll reach for:

drop / keep — remove or retain entire series matching a regex.
labeldrop / labelkeep — remove labels (keeping the metric).
replace — rewrite a label value.

The strategic insight is that most cardinality blowups come from one unbounded label, not a useless metric. So labeldrop usually wins bigger than drop:

metric_relabel_configs:
  # Drop an entire metric you never query
  - source_labels: [__name__]
    regex: 'go_gc_duration_seconds.*'
    action: drop

  # Keep the metric, kill the unbounded label causing the blowup
  - regex: 'pod_template_hash|controller_revision_hash'
    action: labeldrop

The labeldrop example keeps every metric intact but removes labels that explode cardinality without adding query value — a far more surgical cut than dropping whole metrics.

Keep-lists beat drop-lists for noisy exporters

Some exporters emit hundreds of metrics where you use thirty. Maintaining a drop-list for the other hundreds is endless and fragile. Flip it: keep only what you use.

metric_relabel_configs:
  - source_labels: [__name__]
    regex: '(node_cpu_seconds_total|node_memory_MemAvailable_bytes|node_filesystem_avail_bytes|node_load1)'
    action: keep

A keep-list is self-documenting and caps growth automatically — new metrics the exporter adds in a future version don’t sneak in. The trade-off is you must enumerate everything you depend on, which is exactly why the next section matters.

The non-negotiable step: cross-check against alerts

A drop rule produces no error and no warning. It just removes data. If it catches a metric an alert depends on, the alert quietly stops firing. This is the single most important discipline: before applying any drop or keep, reconcile it against every alert rule, recording rule, and dashboard query. Anything referenced must survive; anything you’re unsure about gets flagged, not dropped.

# Before dropping a metric, measure its series count...
count({__name__="suspect_metric"})
# ...and grep your alert/recording rules for any reference to it.

Building the firewall with AI

Enumerating what’s safe to drop across a pile of alert rules is tedious and error-prone by hand — a good fit for an assistant, as long as you force it to cross-check rather than guess:

Here are my top 20 metrics by series count, the high-churn labels, and all my alert and recording rules. Propose a metric_relabel_configs block that cuts cardinality. For every drop, tell me which alert or rule references it — and refuse to drop anything that’s referenced.

Biggest win: labeldrop on pod_template_hash and endpoint across kube_* metrics — none of your alerts group by them, and they’re the main cardinality driver. Do NOT drop kube_pod_status_phase; your PodCrashLooping alert depends on it. For node_exporter, switch to a keep-list of the 12 metrics your dashboards reference. Measure with count({__name__="..."}) before and after on staging.

The model drafts the rules and the dependency map; you verify each claim against your repo and apply on staging first. That cross-check is the human-verifies half, and it’s the difference between a clean cardinality cut and a silently disabled alert. The same review discipline runs through the Prometheus and monitoring prompts.

Order, placement, and rollout

Two operational details people miss. First, rules run top to bottom and a drop short-circuits later rules for that series — so order matters. Second, in a kube-prometheus-stack world, the right place is often a ServiceMonitor’s metricRelabelings, not the central Prometheus config, so the rule travels with the service it governs.

Roll out like any ingest-time change that can’t be undone:

Apply to staging, watch active series (prometheus_tsdb_head_series) drop as expected.
Confirm every alert still has its inputs (count of each metric an alert references stays > 0).
Promote to production and re-measure the series-count win.

Because metric_relabel_configs runs at ingest, dropped series are gone for that scrape with no backfill — there’s no “undo” once data wasn’t stored.

The bottom line

metric_relabel_configs is the most powerful and most dangerous cardinality tool in Prometheus: it can halve your active series or silently disable an alert with equal ease. Prefer labeldrop over whole-metric drop, use keep-lists for chatty exporters, and — above everything — cross-check every drop against your alert and recording rules before it ships. For a structured audit that reconciles drops against your actual dependencies, the metric_relabel drop-list prompt and the broader relabeling rules prompt keep the firewall tight without taking out the alerts you rely on.