AI-Assisted Recording Rules: Turning Slow PromQL Into Fast

There’s a moment every growing Prometheus setup hits where the dashboards start to lag. A panel that used to render instantly now spins for eight seconds, and the Prometheus process is pinned because forty users are all re-running the same expensive histogram_quantile over a million series every fifteen seconds. The fix is recording rules — precompute the expensive bit once, query the cheap result everywhere — but figuring out which queries to precompute and refactoring them correctly is fiddly. I lean on AI for both, with the standing caveat that a recording rule that computes the wrong thing fast is worse than a slow query that’s right.

Why recording rules matter and where they go wrong

A recording rule evaluates an expression on a schedule and stores the result as a new time series. Dashboards and alerts then read that cheap series instead of recomputing the heavy one. The performance win is enormous. The trap is that a recording rule freezes an aggregation level: once you sum by (service) in the rule, you can never break it back down by instance. Pick the wrong aggregation granularity and you’ve made a fast series that can’t answer the questions you’ll ask next week. AI is great at the mechanical refactor and genuinely useful at spotting expensive patterns, but the granularity decision is a judgment call I keep for myself.

Step one: find the expensive queries

Before recording anything, I want to know what’s actually slow. I pull the heavy hitters from Prometheus’s own metrics and have the model help interpret them:

# Slowest rule groups by evaluation time
topk(10, prometheus_rule_group_last_duration_seconds)

# Query load — which queries are scanning the most series
topk(10, rate(prometheus_engine_query_duration_seconds_sum[5m]))

I paste the results and the suspect dashboard queries into the model and ask it to rank candidates for recording by cost and reuse. A query that’s both expensive and used on ten panels is the obvious win; an expensive query used once might not be worth a rule. The model reasons about this well, but I confirm reuse by actually grepping our dashboards rather than taking its word.

Step two: let AI propose the rule, then fix the granularity

Here’s a heavy query and the model’s proposed recording rule:

# Heavy query, run on every dashboard load:
# histogram_quantile(0.95, sum by (le, service)
#   (rate(http_request_duration_seconds_bucket[5m])))

groups:
  - name: latency-recording.rules
    interval: 30s
    rules:
      - record: "service:request_latency:p95_5m"
        expr: |
          histogram_quantile(
            0.95,
            sum by (le, service) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

The model gets the structure right, but notice it baked in 0.95. If I’ll also want p50 and p99, I shouldn’t record three separate quantiles — I should record the bucket sums and compute quantiles at query time:

- record: "service:request_duration_bucket:rate5m"
  expr: |
    sum by (le, service) (
      rate(http_request_duration_seconds_bucket[5m])
    )

Now histogram_quantile(0.95, service:request_duration_bucket:rate5m) and the p50 and p99 all read one cheap recorded series. That’s a granularity decision the model didn’t make on its own — it optimized for the one query I showed it. Showing it the full set of queries that use the metric is what lets it propose the right abstraction.

Pro Tip: When asking AI to design recording rules, give it ALL the queries that touch the metric, not just the slow one. The right recording rule is usually the most general reusable building block — recorded bucket rates instead of a recorded single quantile — and the model can only see that if it sees the full usage.

Step three: verify the recorded series equals the original

This is the non-negotiable check. After deploying the rule, I confirm the recorded series produces the same numbers as the original query for a window where both exist:

# These should be equal once the rule has warmed up:
histogram_quantile(0.95, service:request_duration_bucket:rate5m)
# vs the original inline expression
histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))

If they diverge, the rule’s aggregation dropped or added a label, and I caught a silent data bug before it reached a dashboard. AI’s refactors are usually correct, but “usually” is exactly why this check exists. The model is a fast junior engineer; the equality test is the code review.

Step four: mind the naming and interval

Recording rule names follow a convention — level:metric:operations — and AI is inconsistent about it. I enforce service:request_latency:p95_5m style names because they make the aggregation level legible at a glance, which the metric naming standards guide gets into. I also check the interval: too frequent wastes resources, too slow makes alerts laggy. The model defaults to copying the global interval, which isn’t always right for a specifically heavy group.

Watch for the staleness and lookback mismatch

One subtle correctness issue AI almost never raises on its own is the interaction between a recording rule’s evaluation interval and the lookback windows that read it. If your rule evaluates every 30 seconds but a dashboard queries it with a 1-minute rate() over the recorded series, you can get gaps or doubled-counting depending on alignment. Worse, a recording rule that aggregates a counter and is then rate()’d downstream is a classic mistake — you rate() raw counters, not recorded rates, and reversing that order produces nonsense.

I make the order of operations explicit when reviewing:

This recording rule stores sum by (service) (rate(http_requests_total[5m])). Downstream queries should treat the result as a rate, not re-apply rate(). Confirm there’s no double-rate anywhere in the consumers I’m about to rewrite.

The model is good at catching a double-rate() once you point it at the consumers, but it won’t volunteer the audit, so I drive it. I also confirm the recording rule’s interval is no coarser than the shortest lookback any consumer uses, because a 60s interval feeding a 30s-window query leaves holes. These are exactly the quiet correctness bugs that make a “fast” dashboard subtly wrong, and they’re why the equality check in the previous step is non-negotiable.

Step five: re-point dashboards and alerts

The win only materializes when consumers actually use the recorded series. I have the model help rewrite the dashboard panels and alert expressions to reference the new recording rule, then I review each one — because a panel still running the heavy inline query is a panel that got none of the benefit. Keeping alerts and dashboards pointed at the same recorded series also prevents drift, which matters when the free Alert Rule Generator drafts an alert off the same recording rule.

I do this work in Claude for the YAML and reasoning, and inline with Cursor when the rules live in a repo I’m diffing. For consistency I keep the “design recording rules from these queries” prompt saved in a prompt workspace.

Conclusion

Recording rules are the highest-leverage performance fix in Prometheus, and AI makes both finding the candidates and writing the rules fast. The two things it can’t be trusted with are the granularity decision — which only makes sense given every query that touches the metric — and the correctness check, which means proving the recorded series equals the original. Keep those human, let the model do the mechanical refactor, and your dashboards get fast without quietly going wrong. More in recording rules that make queries fast and the monitoring guides.

AI-Assisted Recording Rules: Turning Slow PromQL Into Fast Dashboards