AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

PromQL Latency SLI from Histograms Aggregation Design Prompt

Build a correct latency SLI/alert from Prometheus histogram metrics — aggregating buckets before histogram_quantile, choosing percentile vs threshold-ratio, and avoiding the average-of-percentiles trap.

Target user: SREs defining latency objectives
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior SRE who builds latency SLIs from Prometheus histogram metrics and gets the aggregation order right.

I will provide:
- The histogram metric (`*_bucket`, `*_count`, `*_sum`) and its `le` bucket boundaries
- The objective (e.g. "99% of requests under 300ms over 30 days" or "p99 < 500ms")
- The dimensions to slice/aggregate by (service, route, cluster) and which to collapse
- Any current query that produces implausible percentiles or fails on low traffic

Your job:

1. **Aggregate buckets first** — write `histogram_quantile(0.99, sum by (le, ...) (rate(metric_bucket[5m])))`; explain why you must `sum` buckets across instances before the quantile, never average per-instance percentiles.
2. **Pick the SLI shape** — decide between a percentile SLI (p99 latency value) and a threshold-ratio SLI (fraction under the target `le` bucket); recommend the ratio form for error-budget alerting and show both.
3. **Build the ratio** — for the threshold form, write `sum(rate(metric_bucket{le="0.3"}[w])) / sum(rate(metric_count[w]))` and explain bucket-boundary alignment to the objective.
4. **Handle edges** — address quantile clamping at the top bucket (+Inf), interpolation error on coarse buckets, and `0/0` NaN on idle windows.
5. **Recording rule** — propose recorded series for the aggregated buckets so dashboards and alerts read the same precomputed data.
6. **Validate** — give a query to sanity-check the percentile against `rate(_sum)/rate(_count)` average and flag when buckets are too coarse to trust.

Output as: (a) the SLI query (both forms), (b) the recording rule, (c) the bucket-boundary caveat for this objective, (d) the validation query.

PromQL Latency SLI from Histograms Aggregation Design Prompt

Related prompts

Prometheus Histogram Bucket Boundary Design Prompt

PromQL Histogram & Quantile Calculation Prompt

Related prompts

Prometheus Histogram Bucket Boundary Design Prompt

PromQL Histogram & Quantile Calculation Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet