Grafana Error Guide: 'context deadline exceeded' on

Overview

“context deadline exceeded” is Go’s way of saying an operation ran past its allotted time and its context was cancelled. In Grafana it appears when a datasource query — Prometheus, Loki, SQL, or any HTTP backend — does not complete before Grafana’s (or the backend’s) timeout fires.

The literal errors you will see on a panel or in the log:

context deadline exceeded

Post "http://prometheus:9090/api/v1/query_range": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

level=error logger=tsdb.prometheus msg="Query error" error="context deadline exceeded"

It is closely related to a datasource-proxy 504, but you will also see it inside SQL/Loki datasource errors and alerting evaluations. The root question is always the same: what took too long, and which timeout fired first?

Symptoms

Panels fail with “context deadline exceeded”, often only on heavy queries or long ranges.
Alert rules go to Error state with the same message during evaluation.
The failure lands at a consistent boundary (30s, 60s) matching a configured timeout.
Light queries on the same datasource succeed.

Common Root Causes

1. Backend query slower than the datasource/proxy timeout

The Prometheus/Loki/SQL query genuinely takes longer than [dataproxy] timeout (default 30s) or the datasource’s own timeout setting.

2. Short per-datasource timeout

The Prometheus datasource’s “Scrape interval”/“Query timeout” or a SQL “Max connection lifetime”/timeout is set too low.

3. Alerting evaluation timeout

Unified alerting evaluates rules with its own timeout; a slow rule query trips context deadline exceeded during evaluation.

4. High-cardinality / long-range query

A rate() or histogram_quantile() over weeks of per-series data is simply expensive.

5. Network latency or an overloaded backend

A congested link or a CPU-starved Prometheus inflates round-trip time past the deadline.

Diagnostic Workflow

Step 1: Locate the error and which subsystem raised it

sudo journalctl -u grafana-server --no-pager | grep -i "deadline exceeded" | tail -20
kubectl logs deploy/grafana -n monitoring | grep -i "deadline exceeded" | tail -20
grep -i "deadline exceeded" /var/log/grafana/grafana.log | tail -20

logger=tsdb.prometheus → panel query; logger=ngalert / alerting → alert evaluation.

Step 2: Time the query against the backend directly

time curl -s -G "http://prometheus:9090/api/v1/query_range" \
  --data-urlencode 'query=histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))' \
  --data-urlencode "start=$(date -d '-24 hours' +%s)" \
  --data-urlencode "end=$(date +%s)" \
  --data-urlencode 'step=60' > /dev/null

The wall-clock time tells you whether the query, or a too-short timeout, is at fault.

Step 3: Check the effective timeouts

# grafana.ini
[dataproxy]
timeout = 60

[unified_alerting]
evaluation_timeout = 30s

Also check the datasource’s own timeout in its provisioning jsonData:

# /etc/grafana/provisioning/datasources/prometheus.yaml
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    jsonData:
      timeInterval: "30s"
      queryTimeout: "60s"
      httpMethod: POST

Step 4: Reduce cost or scale the backend

Prefer recording rules and a larger step before raising timeouts; a starved Prometheus needs CPU/memory, not a longer deadline.

Example Root Cause Analysis

An alert rule flips to Error every evaluation with:

logger=ngalert.eval msg="Failed to evaluate rule" error="context deadline exceeded"

The rule runs sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) across a very large cluster. Timing it directly shows ~35s. The unified alerting evaluation_timeout is the default 30s, so evaluation is cancelled before the query returns.

Fix: add a recording rule pod:container_cpu_usage:rate5m and point the alert at that precomputed series (sub-second), keeping evaluation_timeout at a sane 30s. The alert evaluates cleanly. Root cause: an expensive ad-hoc query in an alert rule exceeding the evaluation deadline — solved by precomputation, not by loosening the timeout.

Prevention Best Practices

Back expensive panel and alert queries with recording rules so evaluations and renders read cheap series; see more Grafana guides.
Set [dataproxy] timeout, datasource queryTimeout, and [unified_alerting] evaluation_timeout consistently and above real query times.
Keep alert-rule queries lightweight; alerts should read precomputed series, not compute heavy aggregations at evaluation time.
Right-size Prometheus/Loki so queries are not slow under normal load.
Use a sensible step/Max data points so the backend is not asked for more points than the panel can show.

Quick Command Reference

# Find the error and which subsystem raised it
sudo journalctl -u grafana-server | grep -i "deadline exceeded" | tail -20
kubectl logs deploy/grafana -n monitoring | grep -i "deadline exceeded" | tail -20

# Time the query directly against the backend
time curl -s -G "http://prometheus:9090/api/v1/query_range" \
  --data-urlencode 'query=<interpolated query>' \
  --data-urlencode "start=$(date -d '-24 hours' +%s)" \
  --data-urlencode "end=$(date +%s)" \
  --data-urlencode 'step=60' > /dev/null

# Effective timeouts (grafana.ini)
# [dataproxy] timeout = 60
# [unified_alerting] evaluation_timeout = 30s

Conclusion

“context deadline exceeded” means a query outran its deadline — the fix depends on which deadline and why:

Read the log to see whether a panel query (tsdb.*) or an alert evaluation (ngalert) timed out.
Time the query directly against the backend to separate “slow query” from “short timeout”.
Prefer recording rules and larger steps over simply raising timeouts.
Align [dataproxy] timeout, datasource queryTimeout, and evaluation_timeout, and give the backend enough CPU.

If the query is genuinely fast and only the timeout is short, raise it; otherwise make the query cheaper — a longer deadline only delays the failure.

Grafana Error Guide: 'context deadline exceeded' on Datasource Queries — Fix Query Timeouts