Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 10 min read

Prometheus Error Guide: 'query timed out' Too Many Samples Loaded

Fix Prometheus 'query timed out' and 'query processing would load too many samples' errors: diagnose high cardinality, wide ranges, expensive PromQL, and query limits.

  • #prometheus-monitoring
  • #troubleshooting
  • #errors
  • #promql

Overview

query timed out and query processing would load too many samples into memory in query execution are query-engine protection errors. When you run a PromQL expression, Prometheus loads the matching samples from the TSDB into memory and evaluates the expression. Two safety limits guard this: --query.timeout (default 2m) aborts queries that run too long, and --query.max-samples (default 50,000,000) rejects queries that would load too many samples at once. Hitting either returns an HTTP 422/503 with the error.

You will see this from the API or in a Grafana panel:

query timed out in expression evaluation
query processing would load too many samples into memory in query execution

The full API response looks like:

{"status":"error","errorType":"execution","error":"query processing would load too many samples into memory in query execution"}

It is a per-query condition driven by how much data the expression touches: the same query that runs fine over 1 hour can fail over 30 days, and an expression matching one series can fail when matching 100,000.

Symptoms

  • Grafana panels show “query timed out” or “too many samples” instead of data, often only for long time ranges.
  • The API returns errorType: execution with one of the two messages.
  • prometheus_engine_queries and prometheus_engine_query_duration_seconds spike.
  • Heavy queries correlate with Prometheus memory pressure or OOM.
histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))
{slice="inner_eval"}  47.2

Common Root Causes

1. High-cardinality selector matching too many series

A selector with a broad regex or a high-churn label matches tens of thousands of series, multiplying the samples loaded. Measure the match size first:

curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=count({__name__=~"http_request.*"})' | jq '.data.result[0].value[1]'
"187422"

A range query over 187k series for 30 days loads billions of samples — straight past query.max-samples.

2. The time range is too wide for the step

A range query loads every sample in the window for every matched series. Estimate the load: series x (range / scrape_interval).

# 50k series over 30 days at 15s scrape:
# 50000 * (30*86400/15) = 8.64e9 samples
echo "$((50000 * (30*86400/15)))"
8640000000

8.6 billion samples is ~172x the default 50M limit; the query is rejected before evaluation.

3. Subqueries and nested range vectors

A subquery like max_over_time(rate(x[5m])[1d:1m]) evaluates the inner expression at every step of the outer range, multiplying cost dramatically.

max_over_time(rate(http_requests_total[5m])[7d:30s])

This evaluates rate(...[5m]) at 30s steps across 7 days — ~20,160 inner evaluations per series, easily timing out on a busy metric.

4. Expensive aggregations without bounding labels

sum, topk, and count over an unbounded matcher force the engine to load and group everything.

topk(10, sum without (instance) (rate(node_cpu_seconds_total[5m])))

If node_cpu_seconds_total spans thousands of nodes and modes, the inner rate and sum load the full set before topk trims it — the trim happens last.

5. query.max-samples / query.timeout set too low for the workload

The defaults may be too tight for a large environment, or someone lowered them. Check the running flags:

curl -s http://localhost:9090/api/v1/status/flags \
  | jq '{timeout: ."query.timeout", max_samples: ."query.max-samples", max_concurrency: ."query.max-concurrency"}'
{
  "timeout": "30s",
  "max_samples": "50000000",
  "max_concurrency": "20"
}

A query.timeout lowered to 30s will reject legitimate long-range dashboard queries that previously worked at 2m.

6. Recording rules absent for hot dashboard queries

Dashboards recomputing the same heavy expression on every refresh, instead of reading a pre-aggregated recording rule, repeatedly hit the limits.

curl -s http://localhost:9090/api/v1/rules \
  | jq -r '.data.groups[].rules[] | select(.type=="recording") | .name' | head
# (empty -> no recording rules; every panel recomputes from raw series)

No recording rules means every wide aggregation runs from scratch against raw data.

Diagnostic Workflow

Step 1: Capture the exact error and the failing expression

curl -s -G 'http://localhost:9090/api/v1/query_range' \
  --data-urlencode 'query=<EXPR>' \
  --data-urlencode 'start=...' --data-urlencode 'end=...' --data-urlencode 'step=...' | jq .error

Distinguish query timed out (slow) from too many samples (volume) — they need different fixes.

Step 2: Measure how many series the selector matches

curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=count(<SELECTOR>)' | jq '.data.result[0].value[1]'

A large count means narrow the matcher before anything else.

Step 3: Estimate the sample load

# series * (range_seconds / scrape_interval_seconds)
echo "$(( SERIES * RANGE_SECONDS / SCRAPE_INTERVAL ))"

Compare against query.max-samples; if it exceeds, the query cannot complete as written.

Step 4: Inspect the configured limits

curl -s http://localhost:9090/api/v1/status/flags \
  | jq '{timeout: ."query.timeout", max_samples: ."query.max-samples"}'

Confirm whether the limit is the default or has been lowered.

Step 5: Check whether a recording rule should exist

curl -s http://localhost:9090/api/v1/rules | jq -r '.data.groups[].name'

If the heavy expression is reused across dashboards, pre-compute it with a recording rule.

Example Root Cause Analysis

A capacity dashboard panel showing sum by (cluster) (rate(container_cpu_usage_seconds_total[5m])) over a 30-day range returns “query processing would load too many samples” while the 6-hour view works.

Counting the inner selector:

curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=count(container_cpu_usage_seconds_total)' | jq '.data.result[0].value[1]'
"612904"

612k series over 30 days at a 30s scrape is ~52.9 billion samples — three orders of magnitude over the 50M limit. Raising the limit would just trade the error for an OOM.

The fix is a recording rule that pre-aggregates to the cluster level once per evaluation, so the dashboard reads a tiny series instead of 600k:

groups:
  - name: capacity.rules
    interval: 30s
    rules:
      - record: cluster:container_cpu_usage:rate5m
        expr: sum by (cluster) (rate(container_cpu_usage_seconds_total[5m]))

The dashboard panel switches to cluster:container_cpu_usage:rate5m, which has a handful of series; the 30-day range now loads thousands of samples instead of billions and renders instantly.

Prevention Best Practices

  • Pre-aggregate hot, wide expressions into recording rules; dashboards should read low-cardinality recorded series, not recompute sum/rate over raw data on every refresh.
  • Bound selectors with specific labels (cluster, namespace, job) instead of broad __name__ regexes; the cheapest query is the one that matches fewer series.
  • Avoid subqueries on dashboards where a recording rule will do; subqueries multiply evaluation cost and are a common timeout source.
  • Keep --query.max-samples and --query.timeout at sensible values for your size, and treat hitting them as a signal to fix the query or add a rule, not just to raise the ceiling.
  • Control cardinality at ingestion with relabel drop rules; fewer series benefits every query.
  • The free incident assistant can flag which dashboard expressions are blowing the sample limit and suggest a recording-rule rewrite; more PromQL guidance is under Prometheus and monitoring.

Quick Command Reference

# Capture the failing query's error
curl -s -G 'http://localhost:9090/api/v1/query_range' \
  --data-urlencode 'query=<EXPR>' --data-urlencode 'start=...' \
  --data-urlencode 'end=...' --data-urlencode 'step=...' | jq .error

# How many series does the selector match?
curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=count(<SELECTOR>)' | jq '.data.result[0].value[1]'

# Current query limits
curl -s http://localhost:9090/api/v1/status/flags \
  | jq '{timeout: ."query.timeout", max_samples: ."query.max-samples"}'

# Do recording rules already exist?
curl -s http://localhost:9090/api/v1/rules \
  | jq -r '.data.groups[].rules[] | select(.type=="recording") | .name'

# Estimate sample load: series * (range / scrape_interval)
echo "$(( SERIES * RANGE_SECONDS / SCRAPE_INTERVAL ))"
# Query engine load
histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))

Conclusion

query timed out and too many samples are the query engine protecting Prometheus from a query that loads more data than it can handle. Work it down:

  1. Separate timed out (slow evaluation) from too many samples (volume).
  2. Count the series the selector matches — broad matchers are the usual root cause.
  3. Estimate series x (range / interval) against query.max-samples.
  4. Check whether query.timeout/query.max-samples were lowered.
  5. Replace reused heavy expressions with recording rules.

The durable fix is almost never raising the limit — it is narrowing the selector, shrinking the range, or pre-aggregating with a recording rule so the query loads orders of magnitude fewer samples.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.