SLOs and Error Budgets With Prometheus, the Practical Way

“Is the service reliable enough?” is an argument until you put a number on it. SLOs are how you stop arguing. They turn reliability into a budget you can spend, defend, and reason about — and they tell you, objectively, whether to ship features or pause and fix things. After years of running this discipline, here’s the practical version, no theology required.

SLI, SLO, error budget — the three words

SLI (Service Level Indicator): a measurement of how well you’re doing. “The fraction of requests that succeed.” A number between 0 and 1.
SLO (Service Level Objective): your target for that SLI. “99.9% of requests succeed over 30 days.”
Error budget: the inverse of the SLO. If you target 99.9%, you’re allowed to fail 0.1% of requests. That 0.1% is a budget you can spend.

The error budget is the brilliant part. It reframes failure from “never acceptable” to “we have a allowance, and when it runs out, we stop shipping risk and fix reliability.” It aligns the people who want features and the people who want stability around one shared number.

Pick SLIs that reflect user pain

A good SLI measures something a user would actually notice. The two workhorses:

Availability — fraction of requests that didn’t error:

sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))

Latency — fraction of requests served fast enough. This is a threshold SLI: “served under 300ms” is good, slower is bad. Histograms make it natural:

sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/ sum(rate(http_request_duration_seconds_count[30d]))

That ratio is “fraction of requests faster than 300ms.” Notice it’s a count of fast requests over total requests — much more meaningful than an average latency, which hides the slow tail that actually hurts users.

Set the target from reality, not aspiration

The rookie move is picking 99.99% because more nines sound better. Each nine costs exponentially more engineering, and an SLO you can’t hit is just a permanent source of guilt.

Look at your actual recent performance. If you’ve been running at 99.7%, set the SLO at 99.9% — a real but achievable stretch. Five nines (99.999%) means 26 seconds of error budget per month; almost nobody needs that, and chasing it bankrupts your roadmap. Match the target to what users need and what you can sustain.

Compute the error budget

If your SLO is 99.9% over 30 days, your budget is 0.1% of requests. Expressed as “how much budget remains”:

# budget remaining as a fraction (1.0 = full budget, 0 = exhausted)
1 - (
  (1 - (sum(rate(http_requests_total{status!~"5.."}[30d]))
        / sum(rate(http_requests_total[30d]))))
  / (1 - 0.999)
)

The inner part is your actual error fraction; dividing by the allowed error fraction (1 - 0.999) gives “what share of the budget you’ve burned.” Subtract from 1 for “what’s left.” When this hits zero, you’re out of budget for the month.

Alert on burn rate, not the SLO itself

Here’s the subtle, important part. Don’t page when the 30-day SLO is breached — by then it’s too late, and one bad hour might breach a monthly target you’d have recovered from. Instead, alert on burn rate: how fast you’re spending the budget.

A burn rate of 1 means you’ll exactly exhaust the budget at the end of the window. A burn rate of 14.4 over an hour means you’d blow a 30-day budget in about two days — that’s a fast-burn emergency.

# fast burn: spending 30-day budget in ~2 days. Page.
- alert: ErrorBudgetFastBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)
  for: 5m
  labels:
    severity: page

The standard pattern uses multiple windows: a fast-burn alert (steep, short window, pages) and a slow-burn alert (gentle, long window, ticket). Google’s SRE workbook popularized 14.4x/1h for fast and ~3x/6h for slow. Two thresholds catch both the sudden outage and the slow bleed.

Use recording rules so the math runs once

The SLI queries above are expensive over 30 days. Don’t recompute them on every dashboard load. Precompute the rate with a recording rule and alert against the cheap result:

groups:
  - name: slo
    rules:
      - record: job:request_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Now dashboards and burn-rate alerts read job:request_errors:rate5m instead of re-deriving it. It’s faster and your SLO definition lives in exactly one place.

Where AI helps

The burn-rate math is genuinely tricky — multi-window alerts, the right multipliers, getting the budget ratio correct. This is where I lean on AI to draft the rules. I state the SLO (“99.9% availability over 30 days, page on a 2-day budget burn”) and let it produce the recording rules and the fast/slow burn alerts. Then I check the multipliers against the workbook and test with promtool.

It saves real time and avoids the classic off-by-a-factor mistakes. We keep monitoring prompts tuned for SLOs, and the Alert Rule Generator will emit burn-rate alerts with the windows already wired up.

The payoff

Once you have SLOs, a lot of arguments evaporate. “Should we ship the risky feature?” — check the budget. “Is this incident a big deal?” — check the burn rate. “Are we over-investing in reliability?” — if you never spend your budget, maybe you’re being too cautious.

SLOs don’t make reliability free. They make it measurable and negotiable, which is the next best thing. Start with one SLI on your most important service, set an honest target, and let the error budget do the arguing for you.

Generated SLO rules and burn-rate alerts are assistive, not authoritative. Always validate the math against your reliability targets and test with promtool before relying on them.