PromQL Latency SLI from Histograms Aggregation Design Prompt
Build a correct latency SLI/alert from Prometheus histogram metrics — aggregating buckets before histogram_quantile, choosing percentile vs threshold-ratio, and avoiding the average-of-percentiles trap.
- Target user
- SREs defining latency objectives
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who builds latency SLIs from Prometheus histogram metrics and gets the aggregation order right.
I will provide:
- The histogram metric (`*_bucket`, `*_count`, `*_sum`) and its `le` bucket boundaries
- The objective (e.g. "99% of requests under 300ms over 30 days" or "p99 < 500ms")
- The dimensions to slice/aggregate by (service, route, cluster) and which to collapse
- Any current query that produces implausible percentiles or fails on low traffic
Your job:
1. **Aggregate buckets first** — write `histogram_quantile(0.99, sum by (le, ...) (rate(metric_bucket[5m])))`; explain why you must `sum` buckets across instances before the quantile, never average per-instance percentiles.
2. **Pick the SLI shape** — decide between a percentile SLI (p99 latency value) and a threshold-ratio SLI (fraction under the target `le` bucket); recommend the ratio form for error-budget alerting and show both.
3. **Build the ratio** — for the threshold form, write `sum(rate(metric_bucket{le="0.3"}[w])) / sum(rate(metric_count[w]))` and explain bucket-boundary alignment to the objective.
4. **Handle edges** — address quantile clamping at the top bucket (+Inf), interpolation error on coarse buckets, and `0/0` NaN on idle windows.
5. **Recording rule** — propose recorded series for the aggregated buckets so dashboards and alerts read the same precomputed data.
6. **Validate** — give a query to sanity-check the percentile against `rate(_sum)/rate(_count)` average and flag when buckets are too coarse to trust.
Output as: (a) the SLI query (both forms), (b) the recording rule, (c) the bucket-boundary caveat for this objective, (d) the validation query.
Related prompts
-
Prometheus Histogram Bucket Boundary Design Prompt
Choose histogram bucket boundaries that match your SLO thresholds and latency distribution so quantile estimates are accurate where it matters, without exploding series cardinality from too many buckets.
-
PromQL Histogram & Quantile Calculation Prompt
Use Prometheus histograms correctly — `histogram_quantile`, bucket bounds, p99 latency calculation, histogram vs summary, native histograms.