Multi-Window Burn-Rate Alerts for SLOs That Work

For years my alerting rule was crude: “page if the error rate is over 1% for five minutes.” It paged me at 3am for a 90-second blip that self-healed, and it stayed silent through a slow leak that quietly burned a week of error budget. Burn-rate alerting fixed both problems, and the multi-window version is what the Google SRE workbook recommends for good reason. Let me walk through why it works and the exact PromQL to implement it.

The problem with a single threshold

A fixed error-rate threshold forces an impossible trade-off. Set it sensitive (1% for 5m) and you page on transient noise. Set it lenient (5% for 30m) and you sleep through a genuine outage’s early minutes. There’s no single number that’s both fast for big problems and quiet for small ones.

Burn rate reframes the question entirely. Instead of “what’s the error rate,” you ask “how fast am I spending my error budget relative to the rate that would exhaust it over the SLO window?”

What burn rate means

Say your SLO is 99.9% availability over 30 days. That gives you a 0.1% error budget. A burn rate of 1 means you’re spending budget at exactly the pace that consumes all of it in 30 days. A burn rate of 14.4 means you’d exhaust the entire 30-day budget in roughly 2 days at the current pace.

The insight: high burn rates are emergencies (page now), low-but-sustained burn rates are slow leaks (page, but with patience). Different burn rates deserve different alerts.

The recording rule foundation

First, express your error ratio as a recording rule over several windows so the alert PromQL stays readable:

groups:
  - name: slo:http
    rules:
      - record: job:slo_errors:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
      - record: job:slo_errors:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
      - record: job:slo_errors:ratio_rate6h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
      - record: job:slo_errors:ratio_rate30m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[30m]))
            / sum(rate(http_requests_total[30m]))

Why two windows per alert

A single short window fires fast but also flaps — a brief spike trips it and it un-trips seconds later. A single long window is stable but slow to fire and slow to resolve. The trick is to require both a long window and a short window to be over threshold. The long window confirms the problem is real and sustained; the short window confirms it’s still happening right now so the alert resolves promptly once you fix it.

The multi-burn-rate alert rules

For a 99.9% SLO (0.1% budget), the canonical thresholds are a fast-burn page and a slow-burn ticket:

groups:
  - name: slo:http:alerts
    rules:
      # Fast burn: 14.4x over 1h AND 5m. Pages. Eats 2% of budget in 1h.
      - alert: SLOErrorBudgetFastBurn
        expr: |
          job:slo_errors:ratio_rate1h > (14.4 * 0.001)
          and
          job:slo_errors:ratio_rate5m > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Burning error budget 14.4x - budget gone in ~2 days"

      # Slow burn: 6x over 6h AND 30m. Ticket, not a page.
      - alert: SLOErrorBudgetSlowBurn
        expr: |
          job:slo_errors:ratio_rate6h > (6 * 0.001)
          and
          job:slo_errors:ratio_rate30m > (6 * 0.001)
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Sustained 6x burn - slow leak draining budget"

The 0.001 is your error budget (1 - 0.999). Change it to 0.0001 for a 99.99% SLO and every threshold scales automatically.

Reading the thresholds

The two-tier design covers the spectrum:

14.4x / 1h+5m catches the genuine outage. At that pace you’d burn 2% of a 30-day budget in one hour — that absolutely warrants waking someone. The 5m co-condition means it clears within minutes of recovery.
6x / 6h+30m catches the slow leak. It won’t page you at 3am, but over a workday it’ll open a ticket so the gradual degradation gets attention before it quietly eats your whole budget.

Many shops add a third, even slower tier (e.g. 1x over 24h+2h) as a low-priority signal. Two tiers is the practical minimum.

Why this beats the old way

Compared to my old “1% for 5 minutes,” multi-window burn-rate alerting:

Stops paging on blips — the long window must agree, so a 90-second spike never reaches you.
Catches slow leaks — the 6x tier sees sustained degradation a fixed threshold would ignore.
Resolves cleanly — the short co-window means alerts clear when the problem clears, not 30 minutes later.
Speaks in budget, not arbitrary percentages — “we’re burning budget 14x” is a far better incident framing than “errors are at 1.4%.”

Operating it

Keep the burn-rate math in recording rules so the alert expressions stay legible — debugging a flapping alert is hard enough without nested rate() calls. Route the page and ticket severities to different Alertmanager receivers so the slow tier never wakes anyone. And revisit your thresholds after a month: if the slow-burn alert never fires, your traffic may be too low for 6h windows to be meaningful, and you’ll want longer ones.

For the SLO foundations and recording-rule patterns this builds on, see the rest of our Prometheus and monitoring guides. When you want a second set of eyes on whether these rules will flap, our monitoring alert assistant evaluates burn-rate rules for exactly that.

Thresholds assume the standard Google SRE multi-burn-rate table. Re-derive them if your SLO window or budget differs from the examples here.