Grafana Error Guide: '504 Gateway Timeout' from the Datasource Proxy — Fix Slow Queries
Fix Grafana datasource proxy 504 Gateway Timeout: diagnose slow backend queries, dataproxy timeout limits, reverse-proxy read timeouts, and heavy PromQL over long ranges.
- #grafana
- #troubleshooting
- #errors
- #timeout
Overview
When a panel queries a datasource, Grafana proxies the request server-side through its datasource proxy (/api/datasources/proxy/...). If the backend does not answer within the proxy’s timeout — or an upstream reverse proxy times out first — Grafana surfaces a 504.
The literal errors you will see on the panel or in the network response:
504 Gateway Timeout
{"message":"Get \"http://prometheus:9090/api/v1/query_range\": context deadline exceeded"}
level=error logger=data-proxy-log msg="Proxy request failed" error="context deadline exceeded" status=504
The key distinction: a 504 means the request reached the backend but the backend (or an intermediary) took too long, not that it was refused (that would be a 502) or unauthorized.
Symptoms
- Heavy panels (long time range, high cardinality) show “504 Gateway Timeout”.
- Lightweight panels on the same datasource load fine.
- The failure is consistent at ~30s or ~60s — a timeout boundary.
- Behind Nginx/Ingress, the 504 comes from the proxy layer, not Grafana itself.
Common Root Causes
1. Slow backend query exceeding [dataproxy] timeout
Grafana’s datasource proxy defaults to a 30-second timeout. A costly rate()/histogram_quantile() over weeks of high-cardinality data blows past it.
2. Upstream reverse proxy read timeout
Nginx (proxy_read_timeout), a Kubernetes Ingress, or a cloud load balancer times out before Grafana does, so the 504 originates upstream.
3. Backend under-resourced
Prometheus/Loki is CPU- or memory-starved and cannot service the query in time.
4. Time range / step too large
A query_range with a tiny step over a huge range forces the backend to compute an enormous number of points.
5. Network path latency
A slow or congested link between Grafana and the datasource inflates round-trip time past the deadline.
Diagnostic Workflow
Step 1: Confirm where the 504 originates
Check the response Server/Via headers in devtools. Then read Grafana’s proxy log:
sudo journalctl -u grafana-server --no-pager | grep -iE "data-proxy|deadline|504" | tail -20
kubectl logs deploy/grafana -n monitoring | grep -iE "proxy|deadline|504" | tail -20
grep -iE "data-proxy|deadline exceeded" /var/log/grafana/grafana.log | tail -20
If Grafana logs context deadline exceeded, Grafana timed out. If Grafana has no error but the browser shows 504, an upstream proxy timed out.
Step 2: Time the backend query directly
time curl -s -G "http://prometheus:9090/api/v1/query_range" \
--data-urlencode 'query=histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))' \
--data-urlencode "start=$(date -d '-7 days' +%s)" \
--data-urlencode "end=$(date +%s)" \
--data-urlencode 'step=60' > /dev/null
If this alone takes 40s, no Grafana setting saves you — the query is the problem.
Step 3: Raise the Grafana proxy timeout (if the query is legitimately heavy)
# grafana.ini
[dataproxy]
timeout = 120
dialTimeout = 10
keep_alive_seconds = 30
Step 4: Raise the upstream reverse-proxy timeout too
# Kubernetes Ingress (nginx) annotations
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
# nginx.conf (self-managed)
proxy_read_timeout 120s;
proxy_send_timeout 120s;
Every timeout in the chain must exceed the query time, or the shortest one wins.
Step 5: Make the query cheaper
# Increase step and use a recording rule instead of raw histogram_quantile at query time
job:http_request_duration_seconds:p95
Example Root Cause Analysis
An SLO dashboard 504s only on the “Last 30 days” p95 latency panel; shorter ranges load. Grafana’s log:
logger=data-proxy-log msg="Proxy request failed" error="context deadline exceeded" status=504
Timing the interpolated query_range directly against Prometheus takes ~48 seconds — a histogram_quantile over 30 days of per-endpoint buckets at a 30s step. The default [dataproxy] timeout = 30 cuts it off.
Fix: precompute the percentile with a recording rule (job:http_request_duration_seconds:p95) so the panel reads a single cheap series, and raise [dataproxy] timeout = 120 as a safety margin plus the matching Ingress proxy-read-timeout. The panel now loads in under a second. The root cause was an expensive query, not a misconfiguration — the timeout bump alone would only mask it.
Prevention Best Practices
- Use recording rules for expensive aggregations (percentiles, high-cardinality sums) so dashboards read precomputed series; see more Grafana guides.
- Align every timeout in the chain (backend → Grafana
[dataproxy] timeout→ reverse proxy/Ingress → LB) so none is shorter than a legitimate query. - Choose a
step/Max data pointssane for the range; more points than screen pixels is wasted backend work. - Right-size Prometheus/Loki CPU and memory; a starved backend times out under otherwise normal load.
- Set Grafana panel query timeouts and alert on slow queries so you catch drift before users do.
- Triage recurring 504s with the free monitoring assistant.
Quick Command Reference
# Where did the 504 come from?
sudo journalctl -u grafana-server | grep -iE "data-proxy|deadline|504" | tail -20
kubectl logs deploy/grafana -n monitoring | grep -iE "proxy|deadline" | tail -20
# Time the backend query directly
time curl -s -G "http://prometheus:9090/api/v1/query_range" \
--data-urlencode 'query=<interpolated query>' \
--data-urlencode "start=$(date -d '-7 days' +%s)" \
--data-urlencode "end=$(date +%s)" \
--data-urlencode 'step=60' > /dev/null
# Raise timeouts (grafana.ini + Ingress)
# [dataproxy] timeout = 120
# nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
Conclusion
A datasource-proxy 504 means something in the request chain ran out of time. Fix it methodically:
- Determine whether Grafana or an upstream proxy timed out (log vs. missing log + browser 504).
- Time the interpolated query directly against the backend — that is the ground truth.
- If the query is legitimately heavy, align all timeouts and, more importantly, make the query cheaper with recording rules and a sane step.
- Right-size the backend so it is not the bottleneck.
Bumping timeouts hides the symptom; recording rules and sane query shapes remove the cause.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.