Debugging 'No Data' and Silently-Broken Prometheus Alerts

The scariest alert is the one that never fires. A noisy alert annoys you into fixing it; a silent one sits there radiating false confidence until the day it should have paged and didn’t. I’ve been burned by this more than once — a relabeling change dropped a label the alert depended on, the alert quietly evaluated to “no data” for six weeks, and nobody noticed until a real incident sailed straight past it. Diagnosing this class of failure is tedious detective work across scrape configs, label sets, and PromQL semantics, which makes it a natural fit for an AI assistant. Here’s how I drive that investigation, and where the model’s guesses need a human leash.

Why “no data” is so easy to miss

A Prometheus alert in the “no data” state isn’t firing and isn’t resolved — it’s evaluating an expression that returns an empty vector. By default that’s silent. The expression could be empty for a dozen reasons: the metric stopped being scraped, a label the query matches on got renamed, the target went down, a recording rule it depends on broke, or the query was always slightly wrong and only matched series by luck. AI is good at this because it can hold all those hypotheses at once and rank them. But it’s a fast junior engineer working from the symptom you describe, so it will confidently propose causes that don’t apply to your setup. The investigation is collaborative, not delegated.

Step one: confirm the alert is actually empty

Before theorizing, I establish ground truth. I take the alert expression and strip it down to the bare selector to see if any series match:

# Full alert expr — returns nothing?
rate(http_requests_total{job="checkout", code=~"5.."}[5m]) > 1

# Strip to the selector — do these series exist at all?
http_requests_total{job="checkout"}

If the stripped selector also returns nothing, the problem is upstream (scrape/target). If the selector returns data but the full expression doesn’t, the problem is in the matchers or the threshold. This bisection is the first thing I tell the model, because it dramatically narrows what it should reason about.

Step two: feed the model the right context

A useless prompt is “my alert isn’t firing, why?” A useful one gives the model the bisection result and the relevant config:

This alert evaluates to no-data. The bare selector http_requests_total{job="checkout"} returns series, but adding code=~"5.." returns nothing. Here are the labels actually present on the series: [paste from the expression browser]. What changed?

With the real label set pasted in, a good model spots immediately that, say, the series carry status_code not code, or that code is "500" as a number-string that =~"5.." matches but status="error" doesn’t. The model is only as good as the context, and the label dump is the context that matters.

Pro Tip: When debugging a silent alert, paste the actual output of the bare selector — labels and all — into the model. Ninety percent of no-data bugs are a label name or value mismatch, and the model can’t see your label schema unless you show it. Describing the metric from memory is how you get a confidently wrong diagnosis.

Step three: catch the relabeling and recording-rule chain

The nastiest silent failures come from indirection. The alert references a recording rule; the recording rule broke; the alert went quiet. Or a metric_relabel_config dropped a label the alert depended on. I ask the model to trace the dependency chain:

This alert references the recording rule job:http_errors:rate5m. Here’s that rule’s definition and here’s my scrape config’s relabel section. Could a relabel have removed a label the recording rule’s aggregation depends on?

AI is genuinely sharp at reasoning about by and without clauses interacting with relabeling — if a sum by (instance) aggregation runs after relabeling stripped instance, the result collapses unexpectedly. I still verify against the live recording rule output rather than trusting the model’s read of my YAML, because it occasionally misreads which labels survive.

Step four: add a guard so it can’t go silent again

Once fixed, I don’t want the alert to fail silently in the future. The fix is a meta-alert using absent() or absent_over_time() that fires when the expected series vanish:

- alert: "CheckoutMetricsMissing"
  expr: 'absent(http_requests_total{job="checkout"})'
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "No checkout request metrics for 10m — alert pipeline may be blind"
    runbook_url: "https://runbooks.internal/missing-metrics"

I let the model draft the absent() guard but I’m careful about its label matchers, because absent() only fires when exactly those labels are missing — too specific and it won’t catch a partial outage. This is exactly the kind of guard the free Alert Rule Generator will scaffold with a sensible for: and runbook, which saves the fiddly part.

The staleness trap: firing-but-wrong, not just no-data

Not every silent failure is empty. A nastier variant is an alert that’s evaluating stale data and reporting a stale conclusion. When a target stops being scraped, its last sample lingers for the staleness window (five minutes by default) and then goes stale — but a recording rule built on top of it can carry an old value forward in ways that fool a naive query. I’ve seen a saturation alert stay green for an hour after the exporter died, simply because the gauge’s last-known-good value was below threshold and nothing updated it to tell the truth.

I ask the model to reason about staleness specifically:

This gauge-based alert went green right as the exporter crashed and stayed green. Could it be evaluating a stale last value rather than current data? How would I prove it?

The proof is usually timestamp(my_metric) compared against time() — if the sample is older than the scrape interval, you’re looking at a ghost. The fix is often an up == 0 or absent_over_time() companion alert so a dead exporter pages on its own rather than hiding behind a stale-but-acceptable reading. AI reasons about this well once you name “staleness” explicitly; left to its own framing it tends to assume fresh data, which is exactly the assumption that lets the bug hide. I confirm the timestamp check against the live series before trusting the diagnosis.

Step five: explain the root cause in the postmortem

The last step is human, and it’s the point of the whole exercise: write down why it went silent so the pattern doesn’t recur. The model can draft the timeline from the evidence, but I own the conclusion. If I can’t explain in plain language why the alert was blind, I haven’t actually fixed it — I’ve just made the symptom go away.

I’ve run these investigations inside Warp’s terminal AI when the evidence is all in promtool and curl output, and in Claude when I’m pasting YAML. When the broken alert turns into a real incident, the incident response dashboard helps structure the timeline.

Conclusion

Silent alerts are dangerous precisely because nothing nags you to fix them. AI accelerates the detective work — bisecting the expression, reasoning about label mismatches, tracing relabel chains — but only when you feed it real label data and verify its diagnoses against live series. Fix the root cause, add an absent() guard so it can’t go quiet again, and write down why it happened. The model finds candidates fast; you decide which one is true. More troubleshooting context in the scrape config and relabeling deep dive and the monitoring guides.

Debugging 'No Data' and Silently-Broken Prometheus Alerts With AI