Prometheus Recording Rules That Make Slow Queries Fast

If your Grafana dashboards take ten seconds to load and your alerts evaluate sluggishly, the fix is usually recording rules. They precompute expensive PromQL on a schedule and store the result as a new, cheap metric. The dashboard reads the precomputed answer instead of re-deriving it every refresh. It’s one of the highest-leverage, least-used features in Prometheus. After years of speeding up slow observability stacks, here’s how I use them.

What a recording rule does

A recording rule evaluates a PromQL expression at a fixed interval and saves the result under a new metric name. That’s it. Instead of every dashboard and every alert recomputing a heavy aggregation, the rule computes it once per interval and everyone reads the cheap result.

groups:
  - name: request_rates
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

Now any dashboard or alert that wants the per-job request rate queries job:http_requests:rate5m — an instant vector lookup — instead of summing rates across thousands of series on every page load.

When to reach for one

I record an expression when it’s both expensive and reused. The signals:

A query touches a huge number of series (a sum(rate(...)) over thousands of targets).
The same expression appears in multiple dashboards or alerts.
A dashboard panel is visibly slow, or Prometheus query logs show it taking hundreds of milliseconds.
An expression spans a long range ([30d] for SLOs) and gets recomputed constantly.

If a query is cheap or used exactly once, don’t record it — you’d just add evaluation cost for no benefit. Recording rules are for the hot, shared, expensive queries.

The naming convention that keeps you sane

Prometheus has a strong recommended naming pattern for recording rules, and following it pays off forever:

level:metric:operations

level — the labels you aggregated to (what you grouped by). job, instance:job, etc.
metric — the underlying metric name.
operations — what you did. rate5m, sum, ratio.

- record: job:http_requests:rate5m
  expr: sum by (job) (rate(http_requests_total[5m]))

- record: job:http_errors:ratio_rate5m
  expr: |
    sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
    / sum by (job) (rate(http_requests_total[5m]))

Read job:http_errors:ratio_rate5m and you know instantly: it’s grouped by job, it’s about http requests/errors, it’s a ratio of 5-minute rates. The colons are legal in metric names only for recording rules, which is a handy visual signal: a metric with colons is precomputed.

Don’t record-then-rate

A classic mistake: recording a raw counter and rating it later. Recording rules should record the final form you want, including the rate. Remember the rate-before-sum rule — bake it into the recording rule so consumers can’t get it wrong.

# GOOD: rate and aggregate inside the rule
- record: job:http_requests:rate5m
  expr: sum by (job) (rate(http_requests_total[5m]))

# BAD: recording a sum of raw counters; rating this later breaks on resets
- record: job:http_requests:sum
  expr: sum by (job) (http_requests_total)

The recorded metric should be directly usable. If consumers have to do more rate math on it, you’ve recorded the wrong thing.

Layer rules for very heavy queries

For SLO-grade math over long windows, layer recording rules. Record the 5-minute rate frequently, then have a second rule aggregate the recorded metric over longer windows. Each layer is cheap because it reads the precomputed result below it, not the raw firehose.

- record: job:http_errors:ratio_rate5m
  expr: |
    sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
    / sum by (job) (rate(http_requests_total[5m]))

- record: job:http_errors:ratio_rate30m
  expr: avg_over_time(job:http_errors:ratio_rate5m[30m])

Your burn-rate alerts then read job:http_errors:ratio_rate5m and :ratio_rate30m — instant, cheap, and consistent everywhere.

Mind the evaluation interval

Rules run on the group’s interval (defaulting to the global evaluation_interval). Don’t set it shorter than your scrape interval — you’d just recompute the same answer. A 30s scrape pairs naturally with a 30s or longer rule interval. Heavy rule groups can be given a longer interval to spread out load.

And remember: recording rules add their own evaluation cost. A thousand needless rules will themselves slow Prometheus down. Record what’s hot and shared; don’t record everything.

Test and validate

Like alert rules, recording rules are checkable with promtool:

promtool check rules recording_rules.yaml

It validates syntax and naming. I also eyeball the first few evaluations after deploy to confirm the recorded metric has the labels and values I expect — a by clause that drops the wrong label produces a recorded metric that’s subtly useless.

Where AI helps

Recording rules are a mechanical transformation of a slow query into a fast precomputed one, and AI does that transformation well. I paste a slow dashboard query and ask for the recording rule with proper level:metric:operations naming. It also spots opportunities — “these three alerts all recompute the same aggregation, record it once” — that are easy to miss by hand.

The naming convention in particular is something AI applies consistently, which keeps a large rule file legible. We keep monitoring prompts for rule design, and the SLO and burn-rate rules our Alert Rule Generator emits read from recorded metrics by default.

The payoff

Recording rules are the quiet upgrade that makes a heavily-loaded Prometheus feel fast again. Find your expensive, reused queries; precompute them once per interval under a well-named metric; layer them for long-window SLO math; and let dashboards and alerts read the cheap result. Your dashboards load instantly, your alerts evaluate cleanly, and your Prometheus stops sweating.

Generated recording rules are assistive, not authoritative. Always verify the recorded metric’s labels and values, and validate with promtool before deploying.