Multi-Window Burn-Rate Alerts for SLOs That Work
Single-threshold error alerts either page too late or too often. Multi-window multi-burn-rate alerting catches fast disasters and slow leaks without crying wolf. Here's the PromQL.
- #prometheus
- #slo
- #alerting
- #burn-rate
- #error-budget
- #sre
For years my alerting rule was crude: “page if the error rate is over 1% for five minutes.” It paged me at 3am for a 90-second blip that self-healed, and it stayed silent through a slow leak that quietly burned a week of error budget. Burn-rate alerting fixed both problems, and the multi-window version is what the Google SRE workbook recommends for good reason. Let me walk through why it works and the exact PromQL to implement it.
The problem with a single threshold
A fixed error-rate threshold forces an impossible trade-off. Set it sensitive (1% for 5m) and you page on transient noise. Set it lenient (5% for 30m) and you sleep through a genuine outage’s early minutes. There’s no single number that’s both fast for big problems and quiet for small ones.
Burn rate reframes the question entirely. Instead of “what’s the error rate,” you ask “how fast am I spending my error budget relative to the rate that would exhaust it over the SLO window?”
What burn rate means
Say your SLO is 99.9% availability over 30 days. That gives you a 0.1% error budget. A burn rate of 1 means you’re spending budget at exactly the pace that consumes all of it in 30 days. A burn rate of 14.4 means you’d exhaust the entire 30-day budget in roughly 2 days at the current pace.
The insight: high burn rates are emergencies (page now), low-but-sustained burn rates are slow leaks (page, but with patience). Different burn rates deserve different alerts.
The recording rule foundation
First, express your error ratio as a recording rule over several windows so the alert PromQL stays readable:
groups:
- name: slo:http
rules:
- record: job:slo_errors:ratio_rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
- record: job:slo_errors:ratio_rate1h
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
- record: job:slo_errors:ratio_rate6h
expr: |
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
- record: job:slo_errors:ratio_rate30m
expr: |
sum(rate(http_requests_total{status=~"5.."}[30m]))
/ sum(rate(http_requests_total[30m]))
Why two windows per alert
A single short window fires fast but also flaps — a brief spike trips it and it un-trips seconds later. A single long window is stable but slow to fire and slow to resolve. The trick is to require both a long window and a short window to be over threshold. The long window confirms the problem is real and sustained; the short window confirms it’s still happening right now so the alert resolves promptly once you fix it.
The multi-burn-rate alert rules
For a 99.9% SLO (0.1% budget), the canonical thresholds are a fast-burn page and a slow-burn ticket:
groups:
- name: slo:http:alerts
rules:
# Fast burn: 14.4x over 1h AND 5m. Pages. Eats 2% of budget in 1h.
- alert: SLOErrorBudgetFastBurn
expr: |
job:slo_errors:ratio_rate1h > (14.4 * 0.001)
and
job:slo_errors:ratio_rate5m > (14.4 * 0.001)
for: 2m
labels:
severity: page
annotations:
summary: "Burning error budget 14.4x - budget gone in ~2 days"
# Slow burn: 6x over 6h AND 30m. Ticket, not a page.
- alert: SLOErrorBudgetSlowBurn
expr: |
job:slo_errors:ratio_rate6h > (6 * 0.001)
and
job:slo_errors:ratio_rate30m > (6 * 0.001)
for: 15m
labels:
severity: ticket
annotations:
summary: "Sustained 6x burn - slow leak draining budget"
The 0.001 is your error budget (1 - 0.999). Change it to 0.0001 for a 99.99% SLO and every threshold scales automatically.
Reading the thresholds
The two-tier design covers the spectrum:
- 14.4x / 1h+5m catches the genuine outage. At that pace you’d burn 2% of a 30-day budget in one hour — that absolutely warrants waking someone. The 5m co-condition means it clears within minutes of recovery.
- 6x / 6h+30m catches the slow leak. It won’t page you at 3am, but over a workday it’ll open a ticket so the gradual degradation gets attention before it quietly eats your whole budget.
Many shops add a third, even slower tier (e.g. 1x over 24h+2h) as a low-priority signal. Two tiers is the practical minimum.
Why this beats the old way
Compared to my old “1% for 5 minutes,” multi-window burn-rate alerting:
- Stops paging on blips — the long window must agree, so a 90-second spike never reaches you.
- Catches slow leaks — the 6x tier sees sustained degradation a fixed threshold would ignore.
- Resolves cleanly — the short co-window means alerts clear when the problem clears, not 30 minutes later.
- Speaks in budget, not arbitrary percentages — “we’re burning budget 14x” is a far better incident framing than “errors are at 1.4%.”
Operating it
Keep the burn-rate math in recording rules so the alert expressions stay legible — debugging a flapping alert is hard enough without nested rate() calls. Route the page and ticket severities to different Alertmanager receivers so the slow tier never wakes anyone. And revisit your thresholds after a month: if the slow-burn alert never fires, your traffic may be too low for 6h windows to be meaningful, and you’ll want longer ones.
For the SLO foundations and recording-rule patterns this builds on, see the rest of our Prometheus and monitoring guides. When you want a second set of eyes on whether these rules will flap, our monitoring alert assistant evaluates burn-rate rules for exactly that.
Thresholds assume the standard Google SRE multi-burn-rate table. Re-derive them if your SLO window or budget differs from the examples here.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.