Prometheus Error Guide: 'query timed out' Too Many Samples Loaded
Fix Prometheus 'query timed out' and 'query processing would load too many samples' errors: diagnose high cardinality, wide ranges, expensive PromQL, and query limits.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #promql
Overview
query timed out and query processing would load too many samples into memory in query execution are query-engine protection errors. When you run a PromQL expression, Prometheus loads the matching samples from the TSDB into memory and evaluates the expression. Two safety limits guard this: --query.timeout (default 2m) aborts queries that run too long, and --query.max-samples (default 50,000,000) rejects queries that would load too many samples at once. Hitting either returns an HTTP 422/503 with the error.
You will see this from the API or in a Grafana panel:
query timed out in expression evaluation
query processing would load too many samples into memory in query execution
The full API response looks like:
{"status":"error","errorType":"execution","error":"query processing would load too many samples into memory in query execution"}
It is a per-query condition driven by how much data the expression touches: the same query that runs fine over 1 hour can fail over 30 days, and an expression matching one series can fail when matching 100,000.
Symptoms
- Grafana panels show “query timed out” or “too many samples” instead of data, often only for long time ranges.
- The API returns
errorType: executionwith one of the two messages. prometheus_engine_queriesandprometheus_engine_query_duration_secondsspike.- Heavy queries correlate with Prometheus memory pressure or OOM.
histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))
{slice="inner_eval"} 47.2
Common Root Causes
1. High-cardinality selector matching too many series
A selector with a broad regex or a high-churn label matches tens of thousands of series, multiplying the samples loaded. Measure the match size first:
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=count({__name__=~"http_request.*"})' | jq '.data.result[0].value[1]'
"187422"
A range query over 187k series for 30 days loads billions of samples — straight past query.max-samples.
2. The time range is too wide for the step
A range query loads every sample in the window for every matched series. Estimate the load: series x (range / scrape_interval).
# 50k series over 30 days at 15s scrape:
# 50000 * (30*86400/15) = 8.64e9 samples
echo "$((50000 * (30*86400/15)))"
8640000000
8.6 billion samples is ~172x the default 50M limit; the query is rejected before evaluation.
3. Subqueries and nested range vectors
A subquery like max_over_time(rate(x[5m])[1d:1m]) evaluates the inner expression at every step of the outer range, multiplying cost dramatically.
max_over_time(rate(http_requests_total[5m])[7d:30s])
This evaluates rate(...[5m]) at 30s steps across 7 days — ~20,160 inner evaluations per series, easily timing out on a busy metric.
4. Expensive aggregations without bounding labels
sum, topk, and count over an unbounded matcher force the engine to load and group everything.
topk(10, sum without (instance) (rate(node_cpu_seconds_total[5m])))
If node_cpu_seconds_total spans thousands of nodes and modes, the inner rate and sum load the full set before topk trims it — the trim happens last.
5. query.max-samples / query.timeout set too low for the workload
The defaults may be too tight for a large environment, or someone lowered them. Check the running flags:
curl -s http://localhost:9090/api/v1/status/flags \
| jq '{timeout: ."query.timeout", max_samples: ."query.max-samples", max_concurrency: ."query.max-concurrency"}'
{
"timeout": "30s",
"max_samples": "50000000",
"max_concurrency": "20"
}
A query.timeout lowered to 30s will reject legitimate long-range dashboard queries that previously worked at 2m.
6. Recording rules absent for hot dashboard queries
Dashboards recomputing the same heavy expression on every refresh, instead of reading a pre-aggregated recording rule, repeatedly hit the limits.
curl -s http://localhost:9090/api/v1/rules \
| jq -r '.data.groups[].rules[] | select(.type=="recording") | .name' | head
# (empty -> no recording rules; every panel recomputes from raw series)
No recording rules means every wide aggregation runs from scratch against raw data.
Diagnostic Workflow
Step 1: Capture the exact error and the failing expression
curl -s -G 'http://localhost:9090/api/v1/query_range' \
--data-urlencode 'query=<EXPR>' \
--data-urlencode 'start=...' --data-urlencode 'end=...' --data-urlencode 'step=...' | jq .error
Distinguish query timed out (slow) from too many samples (volume) — they need different fixes.
Step 2: Measure how many series the selector matches
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=count(<SELECTOR>)' | jq '.data.result[0].value[1]'
A large count means narrow the matcher before anything else.
Step 3: Estimate the sample load
# series * (range_seconds / scrape_interval_seconds)
echo "$(( SERIES * RANGE_SECONDS / SCRAPE_INTERVAL ))"
Compare against query.max-samples; if it exceeds, the query cannot complete as written.
Step 4: Inspect the configured limits
curl -s http://localhost:9090/api/v1/status/flags \
| jq '{timeout: ."query.timeout", max_samples: ."query.max-samples"}'
Confirm whether the limit is the default or has been lowered.
Step 5: Check whether a recording rule should exist
curl -s http://localhost:9090/api/v1/rules | jq -r '.data.groups[].name'
If the heavy expression is reused across dashboards, pre-compute it with a recording rule.
Example Root Cause Analysis
A capacity dashboard panel showing sum by (cluster) (rate(container_cpu_usage_seconds_total[5m])) over a 30-day range returns “query processing would load too many samples” while the 6-hour view works.
Counting the inner selector:
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=count(container_cpu_usage_seconds_total)' | jq '.data.result[0].value[1]'
"612904"
612k series over 30 days at a 30s scrape is ~52.9 billion samples — three orders of magnitude over the 50M limit. Raising the limit would just trade the error for an OOM.
The fix is a recording rule that pre-aggregates to the cluster level once per evaluation, so the dashboard reads a tiny series instead of 600k:
groups:
- name: capacity.rules
interval: 30s
rules:
- record: cluster:container_cpu_usage:rate5m
expr: sum by (cluster) (rate(container_cpu_usage_seconds_total[5m]))
The dashboard panel switches to cluster:container_cpu_usage:rate5m, which has a handful of series; the 30-day range now loads thousands of samples instead of billions and renders instantly.
Prevention Best Practices
- Pre-aggregate hot, wide expressions into recording rules; dashboards should read low-cardinality recorded series, not recompute
sum/rateover raw data on every refresh. - Bound selectors with specific labels (
cluster,namespace,job) instead of broad__name__regexes; the cheapest query is the one that matches fewer series. - Avoid subqueries on dashboards where a recording rule will do; subqueries multiply evaluation cost and are a common timeout source.
- Keep
--query.max-samplesand--query.timeoutat sensible values for your size, and treat hitting them as a signal to fix the query or add a rule, not just to raise the ceiling. - Control cardinality at ingestion with relabel drop rules; fewer series benefits every query.
- The free incident assistant can flag which dashboard expressions are blowing the sample limit and suggest a recording-rule rewrite; more PromQL guidance is under Prometheus and monitoring.
Quick Command Reference
# Capture the failing query's error
curl -s -G 'http://localhost:9090/api/v1/query_range' \
--data-urlencode 'query=<EXPR>' --data-urlencode 'start=...' \
--data-urlencode 'end=...' --data-urlencode 'step=...' | jq .error
# How many series does the selector match?
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=count(<SELECTOR>)' | jq '.data.result[0].value[1]'
# Current query limits
curl -s http://localhost:9090/api/v1/status/flags \
| jq '{timeout: ."query.timeout", max_samples: ."query.max-samples"}'
# Do recording rules already exist?
curl -s http://localhost:9090/api/v1/rules \
| jq -r '.data.groups[].rules[] | select(.type=="recording") | .name'
# Estimate sample load: series * (range / scrape_interval)
echo "$(( SERIES * RANGE_SECONDS / SCRAPE_INTERVAL ))"
# Query engine load
histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))
Conclusion
query timed out and too many samples are the query engine protecting Prometheus from a query that loads more data than it can handle. Work it down:
- Separate
timed out(slow evaluation) fromtoo many samples(volume). - Count the series the selector matches — broad matchers are the usual root cause.
- Estimate
series x (range / interval)againstquery.max-samples. - Check whether
query.timeout/query.max-sampleswere lowered. - Replace reused heavy expressions with recording rules.
The durable fix is almost never raising the limit — it is narrowing the selector, shrinking the range, or pre-aggregating with a recording rule so the query loads orders of magnitude fewer samples.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.