Prometheus Error Guide: 'up == 0' Target DOWN Triage Hub
Fix any Prometheus target showing DOWN with up == 0: triage with the Targets and Service Discovery pages, read the last scrape error, and route to the right root-cause guide.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #scraping
Exact Error Message
A target appears DOWN on the Targets page (/targets) and its metrics return no data. The synthetic up metric for that instance is 0:
State: DOWN Endpoint: http://10.0.0.5:9100/metrics Labels: {instance="10.0.0.5:9100", job="node"}
Last Scrape: 6.142s ago
Error: Get "http://10.0.0.5:9100/metrics": dial tcp 10.0.0.5:9100: connect: connection refused
Sometimes the Error column is empty but the target still shows DOWN with a generic failure, or up == 0 returns in a query while dashboards for that instance flatline:
up{job="node", instance="10.0.0.5:9100"} 0
What the Error Means
up is a metric Prometheus generates itself for every scrape: 1 if the scrape succeeded (a 2xx response that parsed cleanly) and 0 if it failed for any reason. A DOWN target with up == 0 is therefore a symptom, not a root cause — the specific reason is in the last scrape error string. This guide is the triage hub: read the error, then route to the matching root-cause guide.
The four families of last-scrape error, and where each leads:
| Last scrape error contains | Root cause family | Guide |
|---|---|---|
connect: connection refused, no route to host, connection reset by peer | transport / network | connection-refused |
server returned HTTP status 401/403 | auth / RBAC | 401-403-unauthorized |
x509: ... | TLS trust / cert | tls-x509-unknown-authority |
context deadline exceeded | scrape timeout / slow target | context-deadline-exceeded |
Common Causes
A DOWN target with up == 0 almost always reduces to one of:
- The scrape never connected — exporter down, wrong port, firewall, loopback bind (transport family).
- The scrape connected but was rejected — missing/expired credentials or RBAC (auth family).
- The TLS handshake was refused — untrusted CA, SAN mismatch, expired cert (TLS family).
- The scrape connected but didn’t finish in time — slow
/metrics, too many series,scrape_timeouttoo low (timeout family). - Service discovery returned the wrong/stale target — a terminated instance, an old pod IP, or a target that was dropped by relabeling and shouldn’t be there at all.
How to Reproduce the Error
Point a job at any endpoint that can’t be scraped — the simplest is a dead port:
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["10.0.0.5:9100"]
With nothing listening on 10.0.0.5:9100, the next scrape sets up to 0 and the Targets page flips the instance to DOWN with a connection refused last error. Swapping in an auth-protected or HTTPS endpoint reproduces the same DOWN state with a different error string — which is exactly the signal you triage on.
Diagnostic Commands
Step 1 — list every DOWN target with its job, URL, and last error. This single command is the start of every triage:
curl -s http://localhost:9090/api/v1/targets \
| jq -r '.data.activeTargets[] | select(.health!="up")
| [.labels.job, .scrapeUrl, .lastError] | @tsv'
Step 2 — confirm and enumerate with the up metric. Group by job to see whether it’s one instance or a whole job:
curl -s http://localhost:9090/api/v1/query \
--data-urlencode 'query=up == 0' \
| jq -r '.data.result[].metric | "\(.job)\t\(.instance)"'
count by (job) (up == 0)
Step 3 — check whether the scrape even reached the target using scrape_duration_seconds. A non-zero duration means it connected (lean auth/TLS/timeout); a missing/zero duration means it never connected (lean transport):
scrape_duration_seconds{job="node"}
Step 4 — look for slow scrapes near the timeout (the timeout family):
scrape_duration_seconds >= 0.9 * scrape_timeout_seconds
Step 5 — use the Service Discovery page (/service-discovery) to see pre-relabel labels and dropped targets. If a target you expect is in the Dropped targets list, a relabel_configs keep/drop rule removed it — that’s a config problem, not a scrape failure:
# Inspect the relabel rules that decide which targets are kept
curl -s http://localhost:9090/api/v1/status/config | jq -r '.data.yaml' | grep -A10 'relabel_configs'
promtool check config /etc/prometheus/prometheus.yml
Step 6 — read the Prometheus log and alert state for the failing job:
journalctl -u prometheus --no-pager | grep -iE 'scrape failed|x509|401|403|deadline|refused' | tail -20
curl -s 'http://localhost:9090/api/v1/query' --data-urlencode 'query=ALERTS{alertname="TargetDown"}' | jq '.data.result'
Step-by-Step Resolution
The resolution is to read the last scrape error and route:
- Run Step 1 above and read the
lastErrorstring for the DOWN target. - Match the error to a family using the table in What the Error Means.
- Follow the matching guide for the per-cause fix:
connection refused/no route to host/reset by peer→ connection-refused guide. Check the exporter is running and listening (ss -ltnp), the port is right, and firewalls allow the scrape.401/403→ 401/403 guide. Add/fixbasic_authorauthorization, or grant Kubernetes RBAC.x509: ...→ TLS x509 guide. Settls_config.ca_file/server_name, or renew the cert.context deadline exceeded→ timeout guide. Raisescrape_timeout, reduce series, or speed up the target.
- No error string but still DOWN, or the target is missing entirely → check the Service Discovery page for dropped targets and stale SD; fix the
relabel_configskeep/droprule or the SD source. - Validate and reload after any change, then refresh
/targets:
promtool check config /etc/prometheus/prometheus.yml
curl -X POST http://localhost:9090/-/reload
The instance should return to UP and up == 0 should clear.
Prevention and Best Practices
- Run a single, well-tuned
up == 0 for: 5malert across all jobs so every DOWN target pages with its job and instance labels, regardless of root cause. - Include the
lastErrorin your runbook output: the same alert can link straight to the right cause guide based on the error string. - Watch
scrape_duration_secondstrending towardscrape_timeout_secondsto catch timeout-family failures before they flip to DOWN. - Periodically review the Service Discovery page’s Dropped targets — silently dropped endpoints are a frequent “why is there no data?” surprise.
- Prefer dynamic service discovery so terminated instances disappear instead of lingering as permanent DOWN targets.
- Alert on
scrape_samples_scraped == 0to catch targets that scrape UP but return nothing.
Related Errors
- Prometheus Error: connect connection refused scrape DOWN
- Prometheus Error: 401 Unauthorized / 403 Forbidden on scrape
- Prometheus Error: x509 certificate signed by unknown authority
- Prometheus Error: context deadline exceeded scrape timeout
Frequently Asked Questions
What does up == 0 actually mean?
up is a synthetic metric Prometheus writes for every scrape: 1 for a successful scrape, 0 for a failed one. up == 0 tells you a target is failing but not why — the reason is in the last scrape error on the Targets page. Treat up == 0 as the entry point to triage, not the diagnosis.
How do I find why a target is DOWN?
Query /api/v1/targets and read the lastError field (or hover the Error column on the Targets page). Match the error string — connection refused, 401/403, x509, or context deadline exceeded — to the corresponding root-cause guide.
A target is missing from the Targets page entirely instead of showing DOWN — what happened?
It was likely dropped by a relabel_configs keep/drop rule, or service discovery isn’t returning it. Open the Service Discovery page and look in Dropped targets; dropped entries show which relabel action removed them.
Should I alert per target or have one DOWN alert?
One generic up == 0 for: 5m alert keyed by job and instance covers every cause and is far easier to maintain than per-job alerts. Enrich it with the last scrape error so responders can route immediately.
up == 0 but scrape_duration_seconds has a value — what does that tell me?
That the scrape connected (so rule out pure transport failures) and failed later — typically auth (401/403), TLS (x509), or a timeout. A missing or zero duration points instead at a connection-level failure like connection refused.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.