Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 10 min read

Investigating a Prometheus Cardinality Spike With AI as Your Co-Investigator

A cardinality explosion can OOM Prometheus overnight. How I use AI to find the offending label, trace its source, and design a relabel fix without guessing.

  • #prometheus
  • #cardinality
  • #promql
  • #ai
  • #troubleshooting

The page came in at 2am: Prometheus OOM-killed, restarted, OOM-killed again. Memory had doubled in six hours for no obvious reason. If you’ve run Prometheus at any scale you know the cause before you finish reading — a cardinality explosion, almost certainly a new label with unbounded values, probably a user ID or a full URL path that someone shoved into a metric. Finding which metric and which label, then tracing it back to the code or config that introduced it, is exactly the kind of methodical investigation where AI is a genuinely useful co-investigator. It holds the hypotheses, suggests the next query, and reasons about the relabel fix — while I keep it honest against the actual data.

Why cardinality is the silent killer

Every unique combination of label values is a separate time series, and Prometheus holds an index of all of them in memory. Add a label like user_id with a million values and you’ve just created a million series from one metric. The blast radius is memory, ingestion latency, and query slowness, and it escalates fast. The fix is never “add more RAM” — it’s finding the unbounded label and bounding or dropping it. AI accelerates the find, but it’s a fast junior engineer reasoning from the numbers I show it, so every conclusion gets checked against live Prometheus.

Step one: find the worst offenders with TSDB stats

Prometheus exposes its own cardinality. I start with the TSDB status page or these queries and paste the results to the model for ranking:

# Which metric names have the most series?
topk(10, count by (__name__)({__name__=~".+"}))

# Total series count trend — when did it spike?
prometheus_tsdb_head_series

The topk by __name__ points at the guilty metric in seconds. I ask the model to interpret the shape: a metric that’s normally 5,000 series sitting at 800,000 is the smoking gun. The model is good at spotting the outlier in a list, but I confirm the count directly rather than trusting its arithmetic on a pasted table.

Step two: find the unbounded label within the metric

Knowing the metric, I need the label that exploded. I count distinct values per label and have the model reason about which one is unbounded:

# How many distinct values does each label of the suspect metric have?
count(count by (user_id) (http_requests_total))
count(count by (path) (http_requests_total))
count(count by (instance) (http_requests_total))

A label with 3 values is fine; a label with 400,000 is the explosion. I paste the results and the model immediately flags the offender and — usefully — reasons about why it’s unbounded: path containing raw URLs with embedded IDs, user_id that should never have been a label at all, a pod name churning on a crash loop. That causal reasoning is where AI adds value beyond the raw query.

Pro Tip: Have the model distinguish “high but bounded” from “genuinely unbounded” cardinality. A pod label with 5,000 values during a deploy churn is transient and self-heals; a user_id label grows forever. The fix differs — one needs a relabel drop, the other might just need patience — and conflating them leads to dropping a label you actually needed.

Step three: trace it to the source

A label doesn’t appear by magic. Either application code added it to the instrumentation, or a metric_relabel_config failed to drop it, or a scrape picked up a target that exposes it. I give the model the scrape config and ask it to reason about where the label could be entering:

The metric http_requests_total gained a user_id label with 400k values. Here’s the scrape config and relabel rules for this job. Is the label coming through unfiltered, and what’s the most likely source?

AI is sharp at reading relabel chains and noticing that there’s no metric_relabel_config dropping high-cardinality labels for this job. But the real source is often in application code, which the model can’t see — so its answer is a hypothesis I confirm by checking the instrumentation, not a verdict. This is the recurring discipline: the model narrows the search; I close it against ground truth.

Step four: design the relabel fix, then verify it

The fix is usually a metric_relabel_config that drops the offending label before ingestion. I let the model draft it and I review it carefully, because a too-broad drop can break legitimate queries:

metric_relabel_configs:
  # Drop the unbounded user_id label from http_requests_total
  - source_labels: [__name__]
    regex: 'http_requests_total'
    target_label: user_id
    replacement: ''
    action: replace
  # Or drop the metric+label combo entirely if it's never queried by user
  - source_labels: [__name__, user_id]
    regex: 'http_requests_total;.+'
    action: drop

The two approaches differ meaningfully — blanking the label collapses the series safely, while drop discards matching samples entirely. I make the model explain the difference and I confirm which queries depend on that label before choosing. Then I verify the fix in a staging scrape and watch prometheus_tsdb_head_series flatten before rolling to production. The deeper mechanics are in taming Prometheus metric cardinality.

Distinguish ingestion cardinality from query cardinality

Not every memory problem is ingestion cardinality, and AI is genuinely helpful at separating the two failure modes if you ask it to — but it’ll happily conflate them if you don’t. Ingestion cardinality is the index of all active series in the head block; that’s what prometheus_tsdb_head_series measures and what a runaway label inflates. But Prometheus can also OOM from a single expensive query that materializes millions of series transiently — a count by (__name__)({__name__=~".+"}) run by forty dashboards, or a recording rule with an unbounded group_left. The symptoms overlap (high memory, OOM) but the fixes are opposite: one needs a relabel drop, the other needs a query or recording-rule fix.

I make the model help me triage which I’m facing:

# Steady high series count → ingestion cardinality (relabel fix)
prometheus_tsdb_head_series

# Memory spikes correlated with query load → expensive queries
rate(prometheus_engine_queries_concurrent_max[5m])
process_resident_memory_bytes

If memory tracks the series count smoothly, it’s ingestion. If memory spikes in bursts that line up with dashboard refreshes or rule evaluations, it’s query-side, and dropping a label won’t help at all. The model reasons about this correlation well once I give it both time series, but it tends to assume “OOM equals cardinality explosion” by default — which is right often enough to be a dangerous habit. I confirm the correlation against the actual metrics before committing to a fix direction, because chasing the wrong one wastes the night.

Step five: add a guard alert so it can’t recur silently

Once stable, I add an alert that pages before the next explosion OOMs anything:

- alert: "PrometheusCardinalitySpike"
  expr: 'prometheus_tsdb_head_series > 2e6'
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Prometheus head series above 2M — possible cardinality explosion"
    runbook_url: "https://runbooks.internal/cardinality"

The free Alert Rule Generator scaffolds this kind of saturation alert with the runbook link already in place, so the guard goes in with proper structure. The threshold itself is a human call based on your normal series count, not a number the model should invent.

I run these investigations in Warp’s terminal AI when the evidence is promtool tsdb and curl output, and in Claude when I’m reasoning over scrape YAML. When the cardinality spike causes a real outage, the incident response dashboard structures the timeline.

Conclusion

A cardinality explosion is a 2am classic, and the investigation — rank the metrics, find the unbounded label, trace the source, design the relabel fix — is methodical detective work that AI accelerates beautifully as a co-investigator. The constant is that every hypothesis it offers gets checked against live Prometheus, every relabel drop gets reviewed against the queries that depend on it, and the guard threshold stays a human decision. Let the model hold the hypotheses and suggest the next query; you keep it honest against the data. More in taming Prometheus metric cardinality and the monitoring category.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.