quantile_over_time vs histogram_quantile: Which Percentile

Percentiles are where PromQL quietly lies. Two functions advertise themselves as the way to get a p95 or p99 — quantile_over_time and histogram_quantile — and they compute fundamentally different things from fundamentally different data. Use the wrong one and you don’t get an error. You get a number that’s plausible, that you put on a dashboard, that you cite in an SLO review, and that’s simply wrong. The disagreements that follow (“the dashboard says p99 is 200ms but users are timing out”) usually trace back to a function-and-data-shape mismatch nobody noticed.

The two functions, precisely

quantile_over_time(0.95, some_gauge[5m]) takes the 95th percentile of the sampled values of a gauge over the window. The crucial word is sampled. It only sees the snapshots taken at scrape time. If your scrape interval is 15 seconds, it sees one value every 15 seconds and computes the percentile of those points. Anything that happens between scrapes is invisible to it.

histogram_quantile(0.95, rate(some_metric_bucket[5m])) interpolates the 95th percentile from bucketed observation counts. Every single observation lands in a bucket as it happens, so it sees all of them. But its precision is bounded by where the bucket boundaries sit — it interpolates linearly within whichever bucket the quantile falls into.

# Gauge sampled over time — sees only scrape-time snapshots
quantile_over_time(0.95, queue_depth[5m])

# Histogram — sees every observation, limited by bucket layout
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

So the rule of thumb: histogram_quantile for things you instrumented as histograms (latency, sizes), quantile_over_time for the distribution of a gauge’s sampled values over time. Cross them and the result is wrong.

The silent trap in each

quantile_over_time misses sub-scrape spikes. A service that spikes to 2 seconds of latency for 5 seconds, between two 15-second scrapes, contributes nothing — the scrapes landed on the calm moments. You get a clean p99 that hides the tail. This is dangerous precisely because it looks reassuring. And applying it to a counter is meaningless: you’d be taking the percentile of an ever-increasing number.

histogram_quantile is only as good as the buckets. If your p99 lands in the top bucket — say everything above le="1" collapses into le="+Inf" — the function interpolates across that entire unbounded range. The “p99” is a guess over a gap, not a measurement. A poorly chosen bucket layout produces a confident-looking percentile that’s pure interpolation.

# Sanity-bound a histogram quantile against the +Inf total
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Compare buckets: if p99 sits in or near le="+Inf", distrust it.
sum(rate(http_request_duration_seconds_bucket{le="+Inf"}[5m]))

Diagnosing a disputed number with AI

When two people disagree about a percentile, an assistant is useful for figuring out which trap is biting — provided you give it the metric type and make it reason about sampling:

Someone says my p99 latency dashboard reads 180ms but customers report multi-second waits. The query is quantile_over_time(0.99, request_latency_gauge[5m]) and the scrape interval is 30s. Which function should I be using, what’s the trap, and how do I verify?

The trap is quantile_over_time on a gauge scraped every 30s — it only sees 10 samples per 5m window and misses spikes that happen between scrapes, so your real tail is invisible. A gauge can’t give accurate p99 latency at that sampling rate; instrument a histogram instead and use histogram_quantile(0.99, rate(request_latency_seconds_bucket[5m])), which sees every request. To confirm the gap today, compare your gauge p99 against max_over_time(request_latency_gauge[5m]) — if max is multi-second while p99 reads 180ms, sampling is hiding the tail.

The model drafts the diagnosis and a cross-check; you run the check and confirm. Notice the most valuable move: it was willing to say the metric is the wrong shape for the question. No query fixes a sparsely-sampled gauge — the answer is to instrument a histogram. That honesty is what you want from AI-assisted PromQL, and it’s the pattern across the Prometheus and monitoring prompts.

When each is the right call

Use histogram_quantile for latency, payload size, or anything you can instrument as a histogram and where you care about the tail. Pair it with sane bucket boundaries so the quantile doesn’t land in +Inf.
Use quantile_over_time for the distribution of a gauge over time — for example, “what was the 95th-percentile queue depth this hour” — where the value genuinely is a point-in-time measurement and sub-scrape behavior doesn’t matter.
When in doubt, bound the answer: a real p99 must sit between the observed min and max_over_time, and histogram_count over a window should track your event rate.

The bottom line

quantile_over_time percentiles the sampled values of a gauge and is blind between scrapes; histogram_quantile percentiles bucketed counts of every observation and is bounded by your bucket layout. Match the function to the data shape, know each one’s silent failure mode, and bound every percentile with an independent cross-check before you quote it. For help choosing and verifying, the quantile_over_time vs histogram_quantile prompt and the histogram bucket boundary design prompt keep your percentiles honest before they end up in an SLO.

quantile_over_time vs histogram_quantile: Which Percentile to Trust