Detecting Dead Targets in Prometheus with absent() and Staleness Markers
How to alert when a Prometheus metric stops existing using absent(), absent_over_time(), and up==0, plus the staleness rules that silently break no-data alerts.
- #prometheus
- #promql
- #alerting
- #staleness
- #sre
The worst page I ever got was the one I never got. A payments exporter died at 2am, the metric it published simply vanished, and every alert built on top of it evaluated cleanly against zero time series — which in PromQL means it evaluated against nothing at all. No firing alert needs at least one matching series, and a metric that has ceased to exist has zero. We found out four hours later from a customer ticket. That outage taught me a lesson I now repeat to every SRE I onboard: the alerts that fire are easy, and the alerts about data that stopped arriving are the ones that silently fail.
This post is about closing that gap. We’ll walk through up == 0, absent(), absent_over_time(), and the staleness behavior that makes these rules behave in surprising ways. I lean on AI to draft this kind of fiddly PromQL — but I’ll be clear about where it helps and where it will quietly hand you a rule that never fires.
Why “no data” is structurally invisible
Standard threshold alerts assume the data exists. Consider this:
rate(http_requests_total{job="payments"}[5m]) > 1000
If payments is up and busy, this works. If the exporter dies and http_requests_total{job="payments"} disappears, the expression returns an empty vector. An empty result is not “false” — it’s “nothing to compare,” and an alerting rule only fires on the series it returns. Zero series means zero alerts. Your dashboard goes flat, your alert stays green, and everyone assumes silence is health.
This is the trap. You cannot detect the absence of a metric by querying that metric, because the absence is the empty result. You need an expression that produces a series precisely when the data is gone.
up == 0: the target is down, but the metric might not be
The first line of defense is the synthetic up metric Prometheus writes for every scrape:
groups:
- name: target-health
rules:
- alert: TargetDown
expr: up{job="payments"} == 0
for: 5m
labels:
severity: page
annotations:
summary: "Target {{ $labels.instance }} ({{ $labels.job }}) is down"
description: "Prometheus has failed to scrape this target for 5 minutes."
up is reliable because Prometheus generates it whether or not the target responds — a failed scrape yields up == 0 rather than a missing series. That makes it the right tool when an entire target goes away.
But up has a blind spot: it covers the scrape, not the contents. If your exporter is alive and returning HTTP 200 but a specific metric stops being exposed — a recording rule whose source dried up, a feature-flagged metric, a job that stopped emitting one series — then up stays at 1 while the metric you care about is gone. For that, you need absent().
absent(): firing when a named series vanishes
absent() takes a vector and returns a 1-element vector only when its argument is empty. When the argument has data, absent() returns nothing.
- alert: PaymentsMetricMissing
expr: absent(http_requests_total{job="payments"})
for: 10m
labels:
severity: page
annotations:
summary: "http_requests_total for payments has disappeared"
description: "No samples for this series in the last evaluation — exporter or scrape config may be broken."
The critical detail people miss: absent() only knows about the label matchers you literally type. When the series exists, absent() returns empty and there are no labels to template. When it fires, Prometheus synthesizes labels from the equality matchers in your selector. So absent(http_requests_total{job="payments"}) will carry job="payments" on the alert, but absent(http_requests_total{job=~"pay.*"}) carries nothing useful because a regex matcher can’t be reversed into a label value. Always use exact = matchers inside absent() when you want a labeled, routable alert.
Pro Tip: absent() collapses everything to a single series. If five instances of a job all disappear, you get exactly one alert, not five. That’s fine for “is this thing gone,” but it means you can’t tell which instance died from the alert alone — pair it with a per-instance up == 0 rule.
absent_over_time(): tolerating gaps without false pages
Plain absent() evaluates at a single instant, so a brief scrape gap or a metric that’s naturally sparse can trip it. absent_over_time() checks a whole range and only returns a result if the series had no samples across the entire window:
- alert: BatchJobMetricStale
expr: absent_over_time(batch_last_success_timestamp_seconds{job="nightly-etl"}[1h])
for: 0m
labels:
severity: warn
annotations:
summary: "No nightly-ETL success metric seen in the last hour"
description: "The batch job either didn't run or failed before publishing its success marker."
This is the right tool for sparse or intermittent series — Pushgateway metrics, batch jobs, anything that doesn’t report on every scrape. By widening the lookback to a range, you stop alerting on the noise of a single missed scrape and start alerting on genuine absence. Note I set for: 0m here: the [1h] range is the dwell window, so adding a separate for would double-count the wait.
The for-duration vs staleness interplay
Here’s where AI-drafted rules quietly go wrong, and where you have to understand Prometheus internals. Two things interact: the for clause and series staleness.
When a target stops being scraped, Prometheus doesn’t instantly forget the last value. It injects a staleness marker and continues to treat the series as “existing but stale” until roughly 5 minutes after the last successful sample (the default staleness window). During that grace period, absent() of that series may still return empty — because as far as the lookback is concerned, a recent sample exists. Only once the series goes fully stale does absent() start returning 1.
Now layer for on top. With for: 10m, the absence condition must hold continuously for ten minutes after the series has already aged out of the staleness window. Your real time-to-page is closer to staleness-window plus for, not just for. I’ve watched people set for: 1m expecting a fast page and then wonder why nothing fired for six minutes.
- alert: ExporterSeriesGone
expr: absent(node_filesystem_avail_bytes{job="node",mountpoint="/"})
for: 5m
labels:
severity: page
annotations:
summary: "Root filesystem availability metric is absent"
description: "Series stale for >5m and absent for a further 5m. Check the node-exporter on the affected host."
Pro Tip: When you test an absence rule, don’t just kill the target and stare at the expression for thirty seconds. Wait out the staleness window first — about 5 minutes — or you’ll convince yourself a working rule is broken and rip it out.
Where AI fits — and where I make it explain itself
I treat an LLM like a fast junior engineer. It’s excellent at remembering that absent() returns synthetic labels only for = matchers, or at converting a flaky absent() into a range-based absent_over_time() — the kind of syntax I’d otherwise look up. Tools like Claude or ChatGPT will draft a plausible rule in seconds.
What they won’t do is own production. So my rule is simple: every AI-generated alert has to be explainable before it ships. I make the model annotate why the for duration is what it is, what the staleness interaction will be, and what labels the alert will carry when it fires. If it can’t justify those, it doesn’t merge. The free Alert Rule Generator is built around exactly this loop — it drafts the PromQL and the rationale together, so review is part of the workflow rather than an afterthought.
Two failure modes I always check the AI for: regex matchers inside absent() (silently unlabeled alerts), and absent() over a metric that legitimately has zero value sometimes — remember, a metric reporting 0 is present, while absent() only cares about whether the series exists. Confusing “value is zero” with “series is missing” is the single most common bug in AI-drafted dead-target rules.
If you want to go deeper on the symptoms, I’ve written up debugging no-data alerts with AI and the broader discipline of designing alert rules that don’t page you falsely. More monitoring write-ups live under Prometheus monitoring.
Conclusion
Dead-target detection inverts your normal alerting instinct: instead of watching a value cross a threshold, you watch for the series itself to stop existing. up == 0 catches dead targets, absent() catches vanished named series, and absent_over_time() tolerates the gaps so you don’t drown in false pages — but all of them are governed by staleness rules that bend the timing of for. Let AI draft these tricky expressions, then make it explain the staleness math before any of it touches production. The page you’ll be most grateful for is the one that finally fires when the data goes quiet.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.