Prometheus Error Guide: 'context deadline exceeded' Scrape

Overview

context deadline exceeded is the error Prometheus records when a scrape does not complete within scrape_timeout. Prometheus opens an HTTP request to a target’s metrics endpoint, starts a timer equal to scrape_timeout, and if the full request/response cycle (DNS, TCP, TLS, response body, and parsing) is not finished when that timer fires, the scrape is aborted and the target is marked down.

You will see this on the target’s row in the Status -> Targets page or in the up metric’s annotation:

Get "http://10.0.4.21:9100/metrics": context deadline exceeded

The scrape error variant with a wrapped cause looks like:

Get "https://api.svc:8443/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

It is a per-target, per-scrape condition: the same job can have some targets timing out and others healthy. Because the timeout is wall-clock, a target that scrapes in 200ms normally can start failing the moment its /metrics generation slows down or the payload grows.

Symptoms

A target shows DOWN on Status -> Targets with Last Error: context deadline exceeded.
The up series for the target is 0 while the process is clearly running.
scrape_duration_seconds for the target hovers right at the configured scrape_timeout.
Intermittent gaps in dashboards for one job while others stay solid.

up{job="node"} == 0

scrape_duration_seconds{job="node"} >= 9

scrape_duration_seconds{instance="10.0.4.21:9100", job="node"}  9.98

Common Root Causes

1. The target’s /metrics endpoint is genuinely slow

The exporter takes longer than scrape_timeout to produce the response. Time it directly:

curl -o /dev/null -s -w 'time_total=%{time_total}s size=%{size_download}B http=%{http_code}\n' \
  http://10.0.4.21:9100/metrics

time_total=12.430s size=4821007B http=200

A 12s response against a 10s timeout will always fail. This is the most common cause: an overloaded exporter, a cadvisor or node_exporter doing expensive collection, or a custom exporter querying a slow backend on each scrape.

2. scrape_timeout is too low for the job

scrape_timeout defaults to 10s and must be less than or equal to scrape_interval. A job with a heavy payload may need both raised.

grep -nE 'scrape_interval|scrape_timeout' /etc/prometheus/prometheus.yml

15:  scrape_interval: 15s
16:  scrape_timeout: 5s

A global scrape_timeout: 5s against an exporter that consistently needs 7s will time out every cycle even though the endpoint is healthy.

3. The metrics payload is too large

A bloated /metrics (high-cardinality labels, thousands of series) takes a long time to transfer and parse. Check size and series count:

curl -s http://10.0.4.21:9100/metrics | wc -lc

  481922 6122113

Roughly half a million lines / 6MB per scrape is enough to blow a 10s timeout on a busy network or a CPU-starved Prometheus.

4. DNS resolution latency

If targets are addressed by hostname, slow or failing DNS eats into the deadline before the request even starts.

time getent hosts api.svc.cluster.local

real    0m4.812s

Nearly 5s just to resolve the name leaves little of a 5s timeout for the actual scrape. A flaky kube-dns/CoreDNS or an unreachable resolver shows up here.

5. TLS handshake or mTLS overhead

HTTPS targets add a handshake; a slow or misconfigured TLS endpoint can stall. Break the request into phases:

curl -o /dev/null -s -w 'dns=%{time_namelookup}s connect=%{time_connect}s tls=%{time_appconnect}s ttfb=%{time_starttransfer}s\n' \
  https://api.svc:8443/metrics

dns=0.004s connect=0.009s tls=6.221s ttfb=9.880s

A 6s TLS phase points at handshake/cert-chain problems rather than slow metric generation.

6. Prometheus itself is CPU-throttled or overloaded

When the Prometheus process is saturated (or CPU-limited in a container), it cannot parse responses fast enough and scrapes pile up against their deadlines across many jobs at once.

rate(process_cpu_seconds_total{job="prometheus"}[5m])

{job="prometheus"}  0.98

Sustained CPU near the container limit, plus broad context deadline exceeded across unrelated jobs, indicates the collector, not the targets.

Diagnostic Workflow

Step 1: Confirm the exact error and which targets fail

curl -s http://localhost:9090/api/v1/targets \
  | jq -r '.data.activeTargets[] | select(.health=="down") | "\(.labels.job)\t\(.scrapeUrl)\t\(.lastError)"'

This lists every down target and its lastError, separating context deadline exceeded from connection refused or 404s.

Step 2: Time the target endpoint by hand

curl -o /dev/null -s -w 'total=%{time_total}s size=%{size_download}B code=%{http_code}\n' <SCRAPE_URL>

If total exceeds the job’s scrape_timeout, the endpoint is the problem; if it is fast, suspect Prometheus load or DNS.

Step 3: Compare scrape_duration against the configured timeout

scrape_duration_seconds{job="<JOB>"} / on(job) group_left scrape_timeout

Or simply look at the raw values and the scrape_timeout for that job in the config; durations pinned at the ceiling confirm a deadline hit.

Step 4: Break down the request phases

curl -o /dev/null -s -w 'dns=%{time_namelookup}s connect=%{time_connect}s tls=%{time_appconnect}s ttfb=%{time_starttransfer}s total=%{time_total}s\n' <SCRAPE_URL>

Attribute the delay to DNS, connect, TLS, or time-to-first-byte (slow generation).

Step 5: Check Prometheus-side saturation

rate(process_cpu_seconds_total{job="prometheus"}[5m])
sum(scrape_samples_scraped) by (job)

Broad, simultaneous timeouts plus high CPU mean scale Prometheus or cut cardinality, not raise per-job timeouts.

Example Root Cause Analysis

A cadvisor job starts flapping every few minutes with context deadline exceeded, while node and kube-state-metrics stay green.

Timing the endpoint directly:

curl -o /dev/null -s -w 'total=%{time_total}s size=%{size_download}B\n' \
  http://10.0.6.14:8080/metrics

total=11.700s size=18994221B

The payload is 18MB and takes 11.7s to return, but the job inherits the global scrape_timeout: 10s. The node recently had hundreds of short-lived containers scheduled on it, exploding cadvisor’s series count and response size.

The fix has two parts: drop the high-churn per-container series with a metric_relabel_configs keep/drop on container_label_*, and give the heavy job its own timeout:

- job_name: cadvisor
  scrape_interval: 30s
  scrape_timeout: 25s
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'container_(network_tcp_usage_total|tasks_state|memory_failures_total)'
      action: drop

After reloading, the payload drops to ~5MB, scrape_duration_seconds settles near 3s, and the timeouts stop.

Prevention Best Practices

Keep scrape_timeout comfortably below scrape_interval (e.g., 10s timeout for a 15s interval) and give known-heavy jobs (cadvisor, large custom exporters) their own longer interval/timeout rather than raising the global.
Alert on scrape_duration_seconds approaching the timeout (e.g., > 0.8 * scrape_timeout) so you catch creeping slowness before targets flap.
Control cardinality with metric_relabel_configs drop rules; a smaller /metrics is the most durable fix for transfer-time timeouts.
Use IP-based service discovery or a local resolver/cache to keep DNS off the scrape critical path.
Watch Prometheus CPU and scrape_samples_scraped; scale up or shard before broad timeouts appear.
For fast triage, the free incident assistant can group context deadline exceeded errors by job and point at the likely cause. More guides live under Prometheus and monitoring.

Quick Command Reference

# List all down targets with their last error
curl -s http://localhost:9090/api/v1/targets \
  | jq -r '.data.activeTargets[] | select(.health=="down") | "\(.labels.job)\t\(.lastError)"'

# Time a target's endpoint
curl -o /dev/null -s -w 'total=%{time_total}s size=%{size_download}B code=%{http_code}\n' <SCRAPE_URL>

# Break the request into phases (DNS/connect/TLS/TTFB)
curl -o /dev/null -s -w 'dns=%{time_namelookup}s connect=%{time_connect}s tls=%{time_appconnect}s ttfb=%{time_starttransfer}s\n' <SCRAPE_URL>

# How big is the payload?
curl -s <SCRAPE_URL> | wc -lc

# Check timeout/interval in config
grep -nE 'scrape_interval|scrape_timeout' /etc/prometheus/prometheus.yml

# Prometheus-side saturation
# (run in the expression browser)
# rate(process_cpu_seconds_total{job="prometheus"}[5m])

# Targets timing out
scrape_duration_seconds >= scrape_samples_scraped * 0  # use raw values vs configured timeout
up == 0

Conclusion

context deadline exceeded means a scrape ran longer than scrape_timeout. Walk it down in order:

Time the endpoint with curl — if it exceeds the timeout, the target is slow.
Confirm scrape_timeout is realistic for the job’s normal duration.
Check the payload size; high cardinality is the usual culprit for transfer timeouts.
Break the request into DNS/connect/TLS/TTFB phases to localize the delay.
If many unrelated jobs time out together, suspect Prometheus CPU/load, not the targets.

Fix the slow side first — trim cardinality or speed up the exporter — and only raise scrape_timeout for jobs that are legitimately heavy and given their own interval.

Prometheus Error Guide: 'context deadline exceeded' Scrape Timeout