Prometheus Error Guide: 'up == 0' Target DOWN Triage Hub

Exact Error Message

A target appears DOWN on the Targets page (/targets) and its metrics return no data. The synthetic up metric for that instance is 0:

State: DOWN   Endpoint: http://10.0.0.5:9100/metrics   Labels: {instance="10.0.0.5:9100", job="node"}
Last Scrape: 6.142s ago
Error: Get "http://10.0.0.5:9100/metrics": dial tcp 10.0.0.5:9100: connect: connection refused

Sometimes the Error column is empty but the target still shows DOWN with a generic failure, or up == 0 returns in a query while dashboards for that instance flatline:

up{job="node", instance="10.0.0.5:9100"}   0

What the Error Means

up is a metric Prometheus generates itself for every scrape: 1 if the scrape succeeded (a 2xx response that parsed cleanly) and 0 if it failed for any reason. A DOWN target with up == 0 is therefore a symptom, not a root cause — the specific reason is in the last scrape error string. This guide is the triage hub: read the error, then route to the matching root-cause guide.

The four families of last-scrape error, and where each leads:

Last scrape error contains	Root cause family	Guide
`connect: connection refused`, `no route to host`, `connection reset by peer`	transport / network	connection-refused
`server returned HTTP status 401/403`	auth / RBAC	401-403-unauthorized
`x509: ...`	TLS trust / cert	tls-x509-unknown-authority
`context deadline exceeded`	scrape timeout / slow target	context-deadline-exceeded

Common Causes

A DOWN target with up == 0 almost always reduces to one of:

The scrape never connected — exporter down, wrong port, firewall, loopback bind (transport family).
The scrape connected but was rejected — missing/expired credentials or RBAC (auth family).
The TLS handshake was refused — untrusted CA, SAN mismatch, expired cert (TLS family).
The scrape connected but didn’t finish in time — slow /metrics, too many series, scrape_timeout too low (timeout family).
Service discovery returned the wrong/stale target — a terminated instance, an old pod IP, or a target that was dropped by relabeling and shouldn’t be there at all.

How to Reproduce the Error

Point a job at any endpoint that can’t be scraped — the simplest is a dead port:

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["10.0.0.5:9100"]

With nothing listening on 10.0.0.5:9100, the next scrape sets up to 0 and the Targets page flips the instance to DOWN with a connection refused last error. Swapping in an auth-protected or HTTPS endpoint reproduces the same DOWN state with a different error string — which is exactly the signal you triage on.

Diagnostic Commands

Step 1 — list every DOWN target with its job, URL, and last error. This single command is the start of every triage:

curl -s http://localhost:9090/api/v1/targets \
  | jq -r '.data.activeTargets[] | select(.health!="up")
    | [.labels.job, .scrapeUrl, .lastError] | @tsv'

Step 2 — confirm and enumerate with the up metric. Group by job to see whether it’s one instance or a whole job:

curl -s http://localhost:9090/api/v1/query \
  --data-urlencode 'query=up == 0' \
  | jq -r '.data.result[].metric | "\(.job)\t\(.instance)"'

count by (job) (up == 0)

Step 3 — check whether the scrape even reached the target using scrape_duration_seconds. A non-zero duration means it connected (lean auth/TLS/timeout); a missing/zero duration means it never connected (lean transport):

scrape_duration_seconds{job="node"}

Step 4 — look for slow scrapes near the timeout (the timeout family):

scrape_duration_seconds >= 0.9 * scrape_timeout_seconds

Step 5 — use the Service Discovery page (/service-discovery) to see pre-relabel labels and dropped targets. If a target you expect is in the Dropped targets list, a relabel_configs keep/drop rule removed it — that’s a config problem, not a scrape failure:

# Inspect the relabel rules that decide which targets are kept
curl -s http://localhost:9090/api/v1/status/config | jq -r '.data.yaml' | grep -A10 'relabel_configs'
promtool check config /etc/prometheus/prometheus.yml

Step 6 — read the Prometheus log and alert state for the failing job:

journalctl -u prometheus --no-pager | grep -iE 'scrape failed|x509|401|403|deadline|refused' | tail -20
curl -s 'http://localhost:9090/api/v1/query' --data-urlencode 'query=ALERTS{alertname="TargetDown"}' | jq '.data.result'

Step-by-Step Resolution

The resolution is to read the last scrape error and route:

Run Step 1 above and read the lastError string for the DOWN target.
Match the error to a family using the table in What the Error Means.
Follow the matching guide for the per-cause fix:
- connection refused / no route to host / reset by peer → connection-refused guide. Check the exporter is running and listening (ss -ltnp), the port is right, and firewalls allow the scrape.
- 401 / 403 → 401/403 guide. Add/fix basic_auth or authorization, or grant Kubernetes RBAC.
- x509: ... → TLS x509 guide. Set tls_config.ca_file / server_name, or renew the cert.
- context deadline exceeded → timeout guide. Raise scrape_timeout, reduce series, or speed up the target.
No error string but still DOWN, or the target is missing entirely → check the Service Discovery page for dropped targets and stale SD; fix the relabel_configs keep/drop rule or the SD source.
Validate and reload after any change, then refresh /targets:

promtool check config /etc/prometheus/prometheus.yml
curl -X POST http://localhost:9090/-/reload

The instance should return to UP and up == 0 should clear.

Prevention and Best Practices

Run a single, well-tuned up == 0 for: 5m alert across all jobs so every DOWN target pages with its job and instance labels, regardless of root cause.
Include the lastError in your runbook output: the same alert can link straight to the right cause guide based on the error string.
Watch scrape_duration_seconds trending toward scrape_timeout_seconds to catch timeout-family failures before they flip to DOWN.
Periodically review the Service Discovery page’s Dropped targets — silently dropped endpoints are a frequent “why is there no data?” surprise.
Prefer dynamic service discovery so terminated instances disappear instead of lingering as permanent DOWN targets.
Alert on scrape_samples_scraped == 0 to catch targets that scrape UP but return nothing.

Frequently Asked Questions

What does up == 0 actually mean? up is a synthetic metric Prometheus writes for every scrape: 1 for a successful scrape, 0 for a failed one. up == 0 tells you a target is failing but not why — the reason is in the last scrape error on the Targets page. Treat up == 0 as the entry point to triage, not the diagnosis.

How do I find why a target is DOWN? Query /api/v1/targets and read the lastError field (or hover the Error column on the Targets page). Match the error string — connection refused, 401/403, x509, or context deadline exceeded — to the corresponding root-cause guide.

A target is missing from the Targets page entirely instead of showing DOWN — what happened? It was likely dropped by a relabel_configs keep/drop rule, or service discovery isn’t returning it. Open the Service Discovery page and look in Dropped targets; dropped entries show which relabel action removed them.

Should I alert per target or have one DOWN alert? One generic up == 0 for: 5m alert keyed by job and instance covers every cause and is far easier to maintain than per-job alerts. Enrich it with the last scrape error so responders can route immediately.

up == 0 but scrape_duration_seconds has a value — what does that tell me? That the scrape connected (so rule out pure transport failures) and failed later — typically auth (401/403), TLS (x509), or a timeout. A missing or zero duration points instead at a connection-level failure like connection refused.