Prometheus Error Guide: 'query timed out' Too Many Samples

Overview

query timed out and query processing would load too many samples into memory in query execution are query-engine protection errors. When you run a PromQL expression, Prometheus loads the matching samples from the TSDB into memory and evaluates the expression. Two safety limits guard this: --query.timeout (default 2m) aborts queries that run too long, and --query.max-samples (default 50,000,000) rejects queries that would load too many samples at once. Hitting either returns an HTTP 422/503 with the error.

You will see this from the API or in a Grafana panel:

query timed out in expression evaluation

query processing would load too many samples into memory in query execution

The full API response looks like:

{"status":"error","errorType":"execution","error":"query processing would load too many samples into memory in query execution"}

It is a per-query condition driven by how much data the expression touches: the same query that runs fine over 1 hour can fail over 30 days, and an expression matching one series can fail when matching 100,000.

Symptoms

Grafana panels show “query timed out” or “too many samples” instead of data, often only for long time ranges.
The API returns errorType: execution with one of the two messages.
prometheus_engine_queries and prometheus_engine_query_duration_seconds spike.
Heavy queries correlate with Prometheus memory pressure or OOM.

histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))

{slice="inner_eval"}  47.2

Common Root Causes

1. High-cardinality selector matching too many series

A selector with a broad regex or a high-churn label matches tens of thousands of series, multiplying the samples loaded. Measure the match size first:

curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=count({__name__=~"http_request.*"})' | jq '.data.result[0].value[1]'

"187422"

A range query over 187k series for 30 days loads billions of samples — straight past query.max-samples.

2. The time range is too wide for the step

A range query loads every sample in the window for every matched series. Estimate the load: series x (range / scrape_interval).

# 50k series over 30 days at 15s scrape:
# 50000 * (30*86400/15) = 8.64e9 samples
echo "$((50000 * (30*86400/15)))"

8640000000

8.6 billion samples is ~172x the default 50M limit; the query is rejected before evaluation.

3. Subqueries and nested range vectors

A subquery like max_over_time(rate(x[5m])[1d:1m]) evaluates the inner expression at every step of the outer range, multiplying cost dramatically.

max_over_time(rate(http_requests_total[5m])[7d:30s])

This evaluates rate(...[5m]) at 30s steps across 7 days — ~20,160 inner evaluations per series, easily timing out on a busy metric.

4. Expensive aggregations without bounding labels

sum, topk, and count over an unbounded matcher force the engine to load and group everything.

topk(10, sum without (instance) (rate(node_cpu_seconds_total[5m])))

If node_cpu_seconds_total spans thousands of nodes and modes, the inner rate and sum load the full set before topk trims it — the trim happens last.

5. query.max-samples / query.timeout set too low for the workload

The defaults may be too tight for a large environment, or someone lowered them. Check the running flags:

curl -s http://localhost:9090/api/v1/status/flags \
  | jq '{timeout: ."query.timeout", max_samples: ."query.max-samples", max_concurrency: ."query.max-concurrency"}'

{
  "timeout": "30s",
  "max_samples": "50000000",
  "max_concurrency": "20"
}

A query.timeout lowered to 30s will reject legitimate long-range dashboard queries that previously worked at 2m.

6. Recording rules absent for hot dashboard queries

Dashboards recomputing the same heavy expression on every refresh, instead of reading a pre-aggregated recording rule, repeatedly hit the limits.

curl -s http://localhost:9090/api/v1/rules \
  | jq -r '.data.groups[].rules[] | select(.type=="recording") | .name' | head

# (empty -> no recording rules; every panel recomputes from raw series)

No recording rules means every wide aggregation runs from scratch against raw data.

Diagnostic Workflow

Step 1: Capture the exact error and the failing expression

curl -s -G 'http://localhost:9090/api/v1/query_range' \
  --data-urlencode 'query=<EXPR>' \
  --data-urlencode 'start=...' --data-urlencode 'end=...' --data-urlencode 'step=...' | jq .error

Distinguish query timed out (slow) from too many samples (volume) — they need different fixes.

Step 2: Measure how many series the selector matches

curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=count(<SELECTOR>)' | jq '.data.result[0].value[1]'

A large count means narrow the matcher before anything else.

Step 3: Estimate the sample load

# series * (range_seconds / scrape_interval_seconds)
echo "$(( SERIES * RANGE_SECONDS / SCRAPE_INTERVAL ))"

Compare against query.max-samples; if it exceeds, the query cannot complete as written.

Step 4: Inspect the configured limits

curl -s http://localhost:9090/api/v1/status/flags \
  | jq '{timeout: ."query.timeout", max_samples: ."query.max-samples"}'

Confirm whether the limit is the default or has been lowered.

Step 5: Check whether a recording rule should exist

curl -s http://localhost:9090/api/v1/rules | jq -r '.data.groups[].name'

If the heavy expression is reused across dashboards, pre-compute it with a recording rule.

Example Root Cause Analysis

A capacity dashboard panel showing sum by (cluster) (rate(container_cpu_usage_seconds_total[5m])) over a 30-day range returns “query processing would load too many samples” while the 6-hour view works.

Counting the inner selector:

curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=count(container_cpu_usage_seconds_total)' | jq '.data.result[0].value[1]'

"612904"

612k series over 30 days at a 30s scrape is ~52.9 billion samples — three orders of magnitude over the 50M limit. Raising the limit would just trade the error for an OOM.

The fix is a recording rule that pre-aggregates to the cluster level once per evaluation, so the dashboard reads a tiny series instead of 600k:

groups:
  - name: capacity.rules
    interval: 30s
    rules:
      - record: cluster:container_cpu_usage:rate5m
        expr: sum by (cluster) (rate(container_cpu_usage_seconds_total[5m]))

The dashboard panel switches to cluster:container_cpu_usage:rate5m, which has a handful of series; the 30-day range now loads thousands of samples instead of billions and renders instantly.

Prevention Best Practices

Pre-aggregate hot, wide expressions into recording rules; dashboards should read low-cardinality recorded series, not recompute sum/rate over raw data on every refresh.
Bound selectors with specific labels (cluster, namespace, job) instead of broad __name__ regexes; the cheapest query is the one that matches fewer series.
Avoid subqueries on dashboards where a recording rule will do; subqueries multiply evaluation cost and are a common timeout source.
Keep --query.max-samples and --query.timeout at sensible values for your size, and treat hitting them as a signal to fix the query or add a rule, not just to raise the ceiling.
Control cardinality at ingestion with relabel drop rules; fewer series benefits every query.
The free incident assistant can flag which dashboard expressions are blowing the sample limit and suggest a recording-rule rewrite; more PromQL guidance is under Prometheus and monitoring.

Quick Command Reference

# Capture the failing query's error
curl -s -G 'http://localhost:9090/api/v1/query_range' \
  --data-urlencode 'query=<EXPR>' --data-urlencode 'start=...' \
  --data-urlencode 'end=...' --data-urlencode 'step=...' | jq .error

# How many series does the selector match?
curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=count(<SELECTOR>)' | jq '.data.result[0].value[1]'

# Current query limits
curl -s http://localhost:9090/api/v1/status/flags \
  | jq '{timeout: ."query.timeout", max_samples: ."query.max-samples"}'

# Do recording rules already exist?
curl -s http://localhost:9090/api/v1/rules \
  | jq -r '.data.groups[].rules[] | select(.type=="recording") | .name'

# Estimate sample load: series * (range / scrape_interval)
echo "$(( SERIES * RANGE_SECONDS / SCRAPE_INTERVAL ))"

# Query engine load
histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))

Conclusion

query timed out and too many samples are the query engine protecting Prometheus from a query that loads more data than it can handle. Work it down:

Separate timed out (slow evaluation) from too many samples (volume).
Count the series the selector matches — broad matchers are the usual root cause.
Estimate series x (range / interval) against query.max-samples.
Check whether query.timeout/query.max-samples were lowered.
Replace reused heavy expressions with recording rules.

The durable fix is almost never raising the limit — it is narrowing the selector, shrinking the range, or pre-aggregating with a recording rule so the query loads orders of magnitude fewer samples.

Prometheus Error Guide: 'query timed out' Too Many Samples Loaded