Using AI to Untangle an Inherited PromQL Query

The query that broke me was a single line, 240 characters long, with four nested sum by aggregations and a histogram_quantile buried in the middle. It drove a panel on the most-watched dashboard in the company, the person who wrote it had left eighteen months earlier, and the panel had just started showing NaN. No comments, no recording rule, no git blame that explained the intent. I spent forty minutes squinting at it before I gave up and pasted it into an AI assistant. That session changed how I deal with inherited PromQL, and it’s worth explaining what actually helped and where the model would have led me off a cliff if I’d trusted it blindly.

Why inherited PromQL is so hard to read

PromQL reads inside-out, not left-to-right. The innermost selector runs first, then each wrapping function transforms the result, and by the time you reach the outer sum you’ve lost track of what the vector even contains. Add label matchers, without versus by, rate() windows, and offset modifiers, and a one-liner becomes a puzzle box. The query that broke me looked roughly like this:

histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket{env="prod"}[5m]))) > 0.5 and on(service) sum by (service) (rate(http_requests_total{env="prod"}[5m])) > 10

That is doing two genuinely reasonable things at once, but jammed into one expression you cannot reason about it safely. This is exactly the kind of grunt work where an AI assistant earns its keep: it never gets bored decomposing nested expressions, and it’s fast. But it is a fast junior engineer, not an oracle, so the output is a starting hypothesis I verify, not a verdict I deploy.

Step one: ask for a plain-English decomposition

My first prompt is always the same shape. I paste the query and ask the model to explain it bottom-up, naming the vector type at each layer.

Explain this PromQL query layer by layer, from the innermost selector outward. For each layer tell me whether it produces an instant vector or a range vector, and what labels survive.

For the query above, a good model walks through it: the rate(...[5m]) produces a per-second rate over a 5-minute window for each bucket, sum by (le, service) collapses everything except the histogram boundary and service, histogram_quantile(0.95, ...) estimates the 95th-percentile latency per service, the > 0.5 filters to services slower than 500ms, and the and on(service) gate only keeps services also serving more than 10 requests per second. That last clause is the part a human reader almost always misses, and it’s the part that explains the NaN: a low-traffic service had dropped below the 10 rps gate.

Pro Tip: Make the model state the vector type (instant vs range) at every layer. Mismatches there are the single most common source of “this query worked yesterday” bugs, and naming them forces the model to be precise instead of hand-wavy.

Step two: make it prove the explanation against real data

This is the step people skip, and it’s the one that separates a useful AI workflow from a confidently-wrong one. The model’s explanation is a hypothesis. I confirm it by running the sub-expressions myself in the Prometheus expression browser or via the API. Peel one layer at a time:

# Inner: does this return per-bucket rates with an le label?
rate(http_request_duration_seconds_bucket{env="prod"}[5m])

# Next: confirm le and service survive the aggregation
sum by (le, service) (rate(http_request_duration_seconds_bucket{env="prod"}[5m]))

If the model claimed le survives and the expression browser shows it gone, the model is wrong about this dataset, and I’d rather find that out now than after a refactor. The AI gives me the map; Prometheus tells me whether the map matches the territory.

Step three: refactor into recording rules, not a prettier one-liner

Once I understand intent, I don’t want a cleaner 240-character line. I want the logic broken into named, commented recording rules so the next person doesn’t repeat my forty minutes. I ask the model to propose them, then I review every threshold by hand.

groups:
  - name: service-latency.rules
    rules:
      - record: "service:http_request_duration_seconds:p95"
        expr: |
          histogram_quantile(
            0.95,
            sum by (le, service) (
              rate(http_request_duration_seconds_bucket{env="prod"}[5m])
            )
          )
      - record: "service:http_requests:rate5m"
        expr: |
          sum by (service) (
            rate(http_requests_total{env="prod"}[5m])
          )

Now the dashboard panel becomes the readable service:http_request_duration_seconds_bucket:p95 > 0.5 and on(service) service:http_requests:rate5m > 10, and the intent lives in version control with comments. The traffic gate that caused the NaN is now obvious instead of buried.

Where the AI tried to mislead me

Twice in that session the model “helpfully” suggested simplifications that changed behavior. It proposed dropping the on(service) vector matching because it “looked redundant” — it wasn’t; removing it would have changed which series matched and silently broadened the alert. It also offered to swap histogram_quantile for a naive avg, which is a completely different statistic. Both suggestions were plausible-sounding and both were wrong. That’s the recurring lesson: AI output has to be explainable and reviewable before it ships. If the model can’t justify why a change preserves behavior, and I can’t verify it against real series, the change doesn’t merit.

When the query is wrong, not just confusing

Sometimes the untangling reveals that the original query was simply incorrect, and the dashboard has been subtly lying for months. The traffic-gate NaN in my case was a side effect of a query that was technically functioning as written but no longer matched reality after a service got renamed. AI is useful here in a different mode: instead of explaining the query, I ask it to compare the query’s intent — which I’ve now recovered — against what it actually computes, and flag any divergence.

Here’s the original query and here’s what I now believe it was meant to do. Does the query actually achieve that, or has something drifted? List concrete divergences.

In one case this surfaced an offset 1h modifier that someone had added to work around a now-fixed scrape delay, which meant the panel was showing data an hour stale. Nobody had noticed because the panel title didn’t say “1h ago.” That’s the kind of finding that justifies the whole exercise: the query wasn’t just hard to read, it was wrong, and reading it carefully with an AI co-investigator is what exposed it. I still confirm every claimed divergence against the data, because the model occasionally invents a discrepancy that isn’t real — but it points me at the right places to look far faster than I would alone.

Document the recovered intent so it’s not lost again

The final, easily-skipped step is writing the intent down where the next engineer will find it. A recovered understanding that lives only in my head is one resignation away from being lost again. I have the model draft a short comment block and a one-paragraph explanation for the dashboard’s description field, then I edit it for accuracy. The recording rules already carry comments; the panel now carries a plain-English note saying what it measures and why the traffic gate exists. The cost of this is two minutes; the value is that the forty-minute puzzle never has to be solved a third time. AI drafts the documentation fast, but I own the final wording because it’s an operational claim future on-call engineers will rely on.

A reusable workflow

The whole thing compresses into four moves: decompose in English, verify each layer against live data, refactor into named recording rules, and reject any change the model can’t justify. I keep the decomposition prompt in a saved prompt workspace so I’m not retyping it every time a mystery query lands on me. When the untangling reveals a missing or weak alert behind the query, the free Alert Rule Generator is a quick way to draft a replacement with proper for: durations and runbook annotations.

The tooling barely matters — I’ve done this in Claude, ChatGPT, and inline in Cursor. What matters is the discipline of treating the model as a tireless junior who explains its reasoning so you can check it.

Conclusion

Inherited PromQL isn’t hard because the language is bad; it’s hard because dense expressions hide intent. AI is genuinely excellent at recovering that intent quickly, but only if you keep a human in the loop to verify against real data and reject changes that can’t be justified. Decompose, verify, refactor into named rules, and you’ll hand the next engineer a query they can actually read. More PromQL workflows live in the Prometheus & Monitoring guides, and reusable starting prompts are in the prompt library.