AI-Assisted PromQL for Latency Percentiles That Don't Lie

The first time I shipped a p99 latency panel, it was wrong, and it was wrong in the most dangerous way: it looked plausible. The number was stable, it tracked roughly with load, and for three weeks nobody questioned it. Then a real latency incident hit and the panel barely moved while customers were timing out. The bug was a classic histogram_quantile mistake — I’d averaged the quantile across instances instead of summing buckets first — and it had been quietly lying the whole time. These days I draft percentile queries with an AI assistant because it knows the gotchas cold, but I’ve also learned exactly which of its suggestions to distrust. Here’s the working knowledge.

Why histogram_quantile breaks brains

Prometheus histograms are cumulative buckets, each labeled with an le (“less than or equal to”) boundary. histogram_quantile(0.95, ...) estimates the value below which 95% of observations fall, by interpolating across those buckets. The rules that make it correct are non-obvious:

You must rate() the buckets first so you’re looking at recent activity, not all-time counters.
You must aggregate with by (le, ...) — the le label has to survive, or there’s nothing to interpolate across.
You compute the quantile last, after aggregation. Averaging pre-computed quantiles is mathematically meaningless.

Get any of those wrong and the query still returns a number. That’s what makes it treacherous, and it’s why I lean on AI: it has effectively memorized this footgun. But it’s a fast junior engineer, and on at least one detail per session it confidently gets the bucket math subtly wrong, so I verify everything.

The correct shape, and how I prompt for it

My prompt anchors the model in my actual metric and the SLO question I’m answering.

Write a PromQL query for the 95th-percentile latency of http_request_duration_seconds, aggregated per service across all instances, over a 5-minute window. Explain why le must survive the aggregation.

A good answer:

histogram_quantile(
  0.95,
  sum by (le, service) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

The explanation should call out that sum by (le, service) combines the per-instance bucket rates into a single cumulative distribution per service before interpolation. If the model omits le from the by clause, stop — that’s the bug that bit me, and it means the model is pattern-matching badly that session.

Pro Tip: Ask the model to write the WRONG version too — the one that averages quantiles — and explain why it’s wrong. Seeing both side by side is the fastest way to internalize the trap, and it stress-tests whether the model actually understands the math or is just parroting the right shape.

The aggregation trap, in concrete terms

Here’s the query that lied to me for three weeks:

# WRONG: computes p99 per instance, then averages the percentiles
avg by (service) (
  histogram_quantile(0.99,
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Averaging percentiles understates tail latency, because the slow instance’s p99 gets diluted by the fast instances. During my incident, one overloaded pod was timing out while four healthy ones smoothed the number flat. The correct query sums buckets first so the slow pod’s contribution shows up in the tail. When I asked the model to compare the two against a worked example, it produced a clear numerical illustration — that’s the kind of explainable output I’ll actually trust, because I can check the arithmetic.

Native histograms change the syntax

If you’ve moved to Prometheus native histograms, the bucket-juggling mostly disappears and the syntax shifts to histogram_quantile(0.95, sum by (service) (rate(http_request_duration_seconds[5m]))) — note no le and no _bucket suffix. AI assistants frequently mix the two conventions in a single answer because their training data straddles the transition. I always tell the model explicitly which I’m on:

I’m using classic histograms with _bucket series, not native histograms. Do not use native-histogram syntax.

That one sentence eliminates the most common category of wrong answer I get on percentile queries.

Wiring it into an SLO and an alert

Once the percentile query is correct and verified in the expression browser, it becomes the basis for a latency SLO. I record it so the alert and dashboard share one definition:

groups:
  - name: latency-slo.rules
    rules:
      - record: "service:request_latency:p95_5m"
        expr: |
          histogram_quantile(
            0.95,
            sum by (le, service) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )
      - alert: "ServiceLatencyP95High"
        expr: 'service:request_latency:p95_5m > 0.5'
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "p95 latency above 500ms for {{ $labels.service }}"
          runbook_url: "https://runbooks.internal/latency"

I let AI draft the alert wrapper, but the for: 10m and the 500ms threshold are human decisions grounded in the actual SLO target — I never let the model invent thresholds out of thin air. The free Alert Rule Generator is handy here because it forces a for: duration and a runbook link, which is exactly the structure you want around a percentile alert.

The bucket-boundary problem nobody warns you about

There’s a subtler failure that even careful engineers miss, and AI is useful for catching it if you ask the right question. histogram_quantile interpolates linearly within a bucket, which means your result can never be more precise than your bucket boundaries allow. If your histogram’s highest explicit bucket is le="1.0" and everything above that lands in the +Inf bucket, then any p99 that actually falls above one second gets reported as exactly the upper bound the interpolation can reach — it flat-lines and stops telling you the truth about the tail. I’ve seen a p99 panel sit perfectly flat at 1.0 seconds during an incident where real latency was four seconds, simply because there were no buckets above one second to interpolate into.

I now make the model audit my bucket layout against my SLO:

My SLO target is 300ms and I alert on p95 and p99. Here are my histogram bucket boundaries: [paste]. Are my buckets granular enough around the SLO threshold and wide enough in the tail to report p99 accurately?

A good answer points out gaps — too few buckets between 200ms and 500ms to resolve the p95 precisely, or no buckets above 2 seconds so the tail saturates. The fix lives in the instrumentation, not the query, but the query is where the lie shows up, so this audit closes the loop. As always I verify the recommendation against where real observations actually land before re-instrumenting, because changing bucket boundaries is a one-way door once historical data exists.

The verification ritual

Every percentile query gets three checks before it drives anything: run the inner rate(...[5m]) and confirm it returns _bucket series with an le label; run the full query and sanity-check the magnitude against known latency; and deliberately stress one instance in staging to confirm the tail actually moves. If a query can’t pass the stress test, it’s the averaging trap in disguise.

I’ve drafted these in ChatGPT and Claude and inline with GitHub Copilot; all of them get histograms mostly right and occasionally wrong, which is exactly why the human verification step is non-negotiable.

Conclusion

Percentile latency queries are the highest-stakes PromQL most teams write, because they drive the SLOs everyone trusts. AI is a fast, knowledgeable drafting partner for them, but histogram_quantile is precisely where a confident wrong answer does the most damage. Keep le in the aggregation, compute the quantile last, declare your histogram type, and stress-test the tail before you ship. More on this in SLOs and error budgets with Prometheus and the wider monitoring guides.