PromQL Histogram & Quantile Calculation Prompt
Use Prometheus histograms correctly — `histogram_quantile`, bucket bounds, p99 latency calculation, histogram vs summary, native histograms.
- Target user
- SREs calculating latency percentiles in PromQL
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has computed p99/p95 latency in PromQL across many services. You know the histogram trap (wrong bucket bounds = wrong p99) and the difference from summary.
I will provide:
- The latency metric and its buckets (`_bucket{le="..."}` values)
- Current query
- Symptom (p99 looks wrong, NaN, suspicious value)
Your job:
1. **Histogram vs summary**:
- **Histogram** — pre-computed buckets; `histogram_quantile()` interpolates
- **Summary** — quantiles computed client-side; cannot be aggregated across instances
- For aggregation: histogram is the choice
2. **Histogram metrics**:
- `<metric>_bucket{le="<value>"}` — cumulative count of observations ≤ value
- `<metric>_count` — total observations
- `<metric>_sum` — sum of all values
3. **For correct p99**:
```promql
histogram_quantile(0.99,
sum by (le)(rate(http_request_duration_seconds_bucket[5m])))
```
- `sum by (le)` keeps the le label
- `rate()` per bucket
- `histogram_quantile` interpolates
4. **Common errors**:
- `histogram_quantile(0.99, sum(rate(...[5m])))` — missing `by (le)` → NaN
- `histogram_quantile(0.99, http_request_duration_seconds_bucket)` — not rated → cumulative; wrong
- p99 outside bucket range → returns `+Inf`
5. **For bucket bound choice**:
- Buckets should cover the latency range
- Logarithmically spaced typical: 0.01, 0.05, 0.1, 0.5, 1, 5, 10
- Tight buckets in expected range
6. **For aggregation across services**:
- Histograms are sum-able by le
- Quantiles AREN'T sum-able (use histograms instead of summary metrics)
7. **For native histograms** (Prom 2.40+):
- Single metric type vs buckets
- Better aggregation
- Still experimental in some setups
8. **For percentile latency**:
- p50, p95, p99 — combine in dashboard
- Don't confuse with average (`_sum / _count`)
Mark DESTRUCTIVE: removing buckets from histogram (breaks historical), changing bucket bounds (silently changes percentile interpretation), summary aggregation across instances (incorrect).
---
Latency metric: [DESCRIBE]
Current query:
```promql
[PASTE]
```
Symptom: [DESCRIBE]
Why this prompt works
Histograms are mis-used routinely. The histogram_quantile trap (missing by (le)) is the most common. This prompt walks the correct patterns.
How to use it
- Always use histograms for aggregatable percentiles.
- Always
sum by (le)with histogram_quantile. - Choose buckets to cover expected range.
- For native histograms, verify compat.
Useful commands
# Correct p99 by service
histogram_quantile(0.99,
sum by (job, le)(rate(http_request_duration_seconds_bucket[5m])))
# p99 globally
histogram_quantile(0.99,
sum by (le)(rate(http_request_duration_seconds_bucket[5m])))
# Average latency (NOT a percentile)
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])
# Bucket coverage check
sum by (le)(http_request_duration_seconds_bucket)
# Native histograms (2.40+)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds[5m])))
Bucket bound patterns
For typical web service (ms-second latency)
# Application-side (Go example)
histogramOpts := prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
}
For high-throughput, sub-millisecond
Buckets: []float64{0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5}
For batch jobs (seconds to minutes)
Buckets: prometheus.LinearBuckets(60, 60, 10) // 60s, 120s, ..., 600s
Common findings this catches
- p99 = NaN → missing
by (le). - p99 = +Inf → buckets don’t cover; long tail beyond highest bucket.
- p99 constant despite latency change → bucket bounds too coarse.
- Summary metrics aggregated → incorrect; switch to histogram.
- p99 lower than max → expected (statistical, not max).
- histogram_quantile on non-rated bucket → cumulative, wrong.
- Native histogram not in dashboards — driver / Prom version.
When to escalate
- Bucket choice for new service — coordinate with app team.
- Migration from summary to histogram — staged.
- Native histogram adoption — Prom version coordination.
Related prompts
-
Grafana Dashboard Performance Prompt
Optimize Grafana dashboards — query parallelism, refresh rates, variable design, panel count, data source pressure.
-
PromQL Query Optimization Prompt
Diagnose slow PromQL queries — cardinality explosion, range vector traps, sum vs avg pitfalls, query timeout, recording rules opportunity.
-
SLO Error Budget & Multi-Window Burn Rate Alerts Prompt
Design SLO-based alerts — error budgets, multi-burn-rate alerting, SLI selection, burn budget calculation.