Prometheus Error Guide: 'context deadline exceeded' Scrape Timeout
Fix the Prometheus 'context deadline exceeded' scrape error: diagnose slow targets, low scrape_timeout, large /metrics payloads, DNS latency, and TLS handshake delays.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #scrape
Overview
context deadline exceeded is the error Prometheus records when a scrape does not complete within scrape_timeout. Prometheus opens an HTTP request to a target’s metrics endpoint, starts a timer equal to scrape_timeout, and if the full request/response cycle (DNS, TCP, TLS, response body, and parsing) is not finished when that timer fires, the scrape is aborted and the target is marked down.
You will see this on the target’s row in the Status -> Targets page or in the up metric’s annotation:
Get "http://10.0.4.21:9100/metrics": context deadline exceeded
The scrape error variant with a wrapped cause looks like:
Get "https://api.svc:8443/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
It is a per-target, per-scrape condition: the same job can have some targets timing out and others healthy. Because the timeout is wall-clock, a target that scrapes in 200ms normally can start failing the moment its /metrics generation slows down or the payload grows.
Symptoms
- A target shows
DOWNon Status -> Targets withLast Error: context deadline exceeded. - The
upseries for the target is0while the process is clearly running. scrape_duration_secondsfor the target hovers right at the configuredscrape_timeout.- Intermittent gaps in dashboards for one job while others stay solid.
up{job="node"} == 0
scrape_duration_seconds{job="node"} >= 9
scrape_duration_seconds{instance="10.0.4.21:9100", job="node"} 9.98
Common Root Causes
1. The target’s /metrics endpoint is genuinely slow
The exporter takes longer than scrape_timeout to produce the response. Time it directly:
curl -o /dev/null -s -w 'time_total=%{time_total}s size=%{size_download}B http=%{http_code}\n' \
http://10.0.4.21:9100/metrics
time_total=12.430s size=4821007B http=200
A 12s response against a 10s timeout will always fail. This is the most common cause: an overloaded exporter, a cadvisor or node_exporter doing expensive collection, or a custom exporter querying a slow backend on each scrape.
2. scrape_timeout is too low for the job
scrape_timeout defaults to 10s and must be less than or equal to scrape_interval. A job with a heavy payload may need both raised.
grep -nE 'scrape_interval|scrape_timeout' /etc/prometheus/prometheus.yml
15: scrape_interval: 15s
16: scrape_timeout: 5s
A global scrape_timeout: 5s against an exporter that consistently needs 7s will time out every cycle even though the endpoint is healthy.
3. The metrics payload is too large
A bloated /metrics (high-cardinality labels, thousands of series) takes a long time to transfer and parse. Check size and series count:
curl -s http://10.0.4.21:9100/metrics | wc -lc
481922 6122113
Roughly half a million lines / 6MB per scrape is enough to blow a 10s timeout on a busy network or a CPU-starved Prometheus.
4. DNS resolution latency
If targets are addressed by hostname, slow or failing DNS eats into the deadline before the request even starts.
time getent hosts api.svc.cluster.local
real 0m4.812s
Nearly 5s just to resolve the name leaves little of a 5s timeout for the actual scrape. A flaky kube-dns/CoreDNS or an unreachable resolver shows up here.
5. TLS handshake or mTLS overhead
HTTPS targets add a handshake; a slow or misconfigured TLS endpoint can stall. Break the request into phases:
curl -o /dev/null -s -w 'dns=%{time_namelookup}s connect=%{time_connect}s tls=%{time_appconnect}s ttfb=%{time_starttransfer}s\n' \
https://api.svc:8443/metrics
dns=0.004s connect=0.009s tls=6.221s ttfb=9.880s
A 6s TLS phase points at handshake/cert-chain problems rather than slow metric generation.
6. Prometheus itself is CPU-throttled or overloaded
When the Prometheus process is saturated (or CPU-limited in a container), it cannot parse responses fast enough and scrapes pile up against their deadlines across many jobs at once.
rate(process_cpu_seconds_total{job="prometheus"}[5m])
{job="prometheus"} 0.98
Sustained CPU near the container limit, plus broad context deadline exceeded across unrelated jobs, indicates the collector, not the targets.
Diagnostic Workflow
Step 1: Confirm the exact error and which targets fail
curl -s http://localhost:9090/api/v1/targets \
| jq -r '.data.activeTargets[] | select(.health=="down") | "\(.labels.job)\t\(.scrapeUrl)\t\(.lastError)"'
This lists every down target and its lastError, separating context deadline exceeded from connection refused or 404s.
Step 2: Time the target endpoint by hand
curl -o /dev/null -s -w 'total=%{time_total}s size=%{size_download}B code=%{http_code}\n' <SCRAPE_URL>
If total exceeds the job’s scrape_timeout, the endpoint is the problem; if it is fast, suspect Prometheus load or DNS.
Step 3: Compare scrape_duration against the configured timeout
scrape_duration_seconds{job="<JOB>"} / on(job) group_left scrape_timeout
Or simply look at the raw values and the scrape_timeout for that job in the config; durations pinned at the ceiling confirm a deadline hit.
Step 4: Break down the request phases
curl -o /dev/null -s -w 'dns=%{time_namelookup}s connect=%{time_connect}s tls=%{time_appconnect}s ttfb=%{time_starttransfer}s total=%{time_total}s\n' <SCRAPE_URL>
Attribute the delay to DNS, connect, TLS, or time-to-first-byte (slow generation).
Step 5: Check Prometheus-side saturation
rate(process_cpu_seconds_total{job="prometheus"}[5m])
sum(scrape_samples_scraped) by (job)
Broad, simultaneous timeouts plus high CPU mean scale Prometheus or cut cardinality, not raise per-job timeouts.
Example Root Cause Analysis
A cadvisor job starts flapping every few minutes with context deadline exceeded, while node and kube-state-metrics stay green.
Timing the endpoint directly:
curl -o /dev/null -s -w 'total=%{time_total}s size=%{size_download}B\n' \
http://10.0.6.14:8080/metrics
total=11.700s size=18994221B
The payload is 18MB and takes 11.7s to return, but the job inherits the global scrape_timeout: 10s. The node recently had hundreds of short-lived containers scheduled on it, exploding cadvisor’s series count and response size.
The fix has two parts: drop the high-churn per-container series with a metric_relabel_configs keep/drop on container_label_*, and give the heavy job its own timeout:
- job_name: cadvisor
scrape_interval: 30s
scrape_timeout: 25s
metric_relabel_configs:
- source_labels: [__name__]
regex: 'container_(network_tcp_usage_total|tasks_state|memory_failures_total)'
action: drop
After reloading, the payload drops to ~5MB, scrape_duration_seconds settles near 3s, and the timeouts stop.
Prevention Best Practices
- Keep
scrape_timeoutcomfortably belowscrape_interval(e.g., 10s timeout for a 15s interval) and give known-heavy jobs (cadvisor, large custom exporters) their own longer interval/timeout rather than raising the global. - Alert on
scrape_duration_secondsapproaching the timeout (e.g.,> 0.8 * scrape_timeout) so you catch creeping slowness before targets flap. - Control cardinality with
metric_relabel_configsdrop rules; a smaller/metricsis the most durable fix for transfer-time timeouts. - Use IP-based service discovery or a local resolver/cache to keep DNS off the scrape critical path.
- Watch Prometheus CPU and
scrape_samples_scraped; scale up or shard before broad timeouts appear. - For fast triage, the free incident assistant can group
context deadline exceedederrors by job and point at the likely cause. More guides live under Prometheus and monitoring.
Quick Command Reference
# List all down targets with their last error
curl -s http://localhost:9090/api/v1/targets \
| jq -r '.data.activeTargets[] | select(.health=="down") | "\(.labels.job)\t\(.lastError)"'
# Time a target's endpoint
curl -o /dev/null -s -w 'total=%{time_total}s size=%{size_download}B code=%{http_code}\n' <SCRAPE_URL>
# Break the request into phases (DNS/connect/TLS/TTFB)
curl -o /dev/null -s -w 'dns=%{time_namelookup}s connect=%{time_connect}s tls=%{time_appconnect}s ttfb=%{time_starttransfer}s\n' <SCRAPE_URL>
# How big is the payload?
curl -s <SCRAPE_URL> | wc -lc
# Check timeout/interval in config
grep -nE 'scrape_interval|scrape_timeout' /etc/prometheus/prometheus.yml
# Prometheus-side saturation
# (run in the expression browser)
# rate(process_cpu_seconds_total{job="prometheus"}[5m])
# Targets timing out
scrape_duration_seconds >= scrape_samples_scraped * 0 # use raw values vs configured timeout
up == 0
Conclusion
context deadline exceeded means a scrape ran longer than scrape_timeout. Walk it down in order:
- Time the endpoint with
curl— if it exceeds the timeout, the target is slow. - Confirm
scrape_timeoutis realistic for the job’s normal duration. - Check the payload size; high cardinality is the usual culprit for transfer timeouts.
- Break the request into DNS/connect/TLS/TTFB phases to localize the delay.
- If many unrelated jobs time out together, suspect Prometheus CPU/load, not the targets.
Fix the slow side first — trim cardinality or speed up the exporter — and only raise scrape_timeout for jobs that are legitimately heavy and given their own interval.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.