Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 11 min read

Refactoring Legacy Threshold Alerts to Burn-Rate Alerts With AI

Old 'error rate over 1% for 5m' alerts page too much and catch too little. How I use AI to migrate threshold alerts to SLO burn-rate alerting safely.

  • #prometheus
  • #alerting
  • #slo
  • #burn-rate
  • #ai
  • #migration

Most teams accumulate alert rules the way houses accumulate junk drawers. Someone added “error rate over 1% for 5 minutes” in 2021, someone else bolted on “latency over 800ms” the next quarter, and three years later you have ninety threshold alerts that nobody fully understands, half of which page too often and the other half of which would miss a real outage. The modern answer is SLO-based burn-rate alerting, but nobody migrates ninety rules by hand. This is a migration project, and migration projects — repetitive transformations following a known pattern — are where AI shines as a fast junior engineer. The catch is that you’re touching the thing that wakes people up, so the review bar is high.

Why threshold alerts age badly

A static threshold like “error rate over 1% for 5m” has no concept of budget. It pages identically whether you’ve burned 2% of your monthly error budget or 80%. It fires on a 6-minute blip that self-heals and stays quiet through a slow week-long leak that quietly exhausts your budget. Burn-rate alerting reframes the question from “is the error rate high right now” to “are we burning our error budget fast enough to run out before the window closes,” which is the question that actually maps to user pain. The migration is mechanical once you know the target shape — and “mechanical, following a known pattern” is the AI sweet spot.

Step one: inventory and classify with AI

I start by pasting the whole rule directory and asking the model to classify, not transform:

Here are 90 alert rules. For each, classify it as: (a) an availability/error-rate SLO candidate, (b) a latency SLO candidate, (c) a resource/saturation alert that should stay a threshold, or (d) redundant with another rule. Don’t rewrite anything yet — just classify and flag duplicates.

This is the highest-value step because not everything should become a burn-rate alert. Disk-full and certificate-expiry are genuinely threshold conditions; forcing them into an SLO frame is wrong. AI is good at this triage, and the duplicate-flagging alone usually lets me delete a dozen rules. I review the classification — the model occasionally miscategorizes a saturation alert as an SLO candidate — but it does 90% of the sorting in one pass.

Step two: define the SLI before the alert

For each SLO candidate I need a recorded SLI. I have the model derive numerator and denominator from the existing threshold alert’s expression, which already encodes what the team considered “bad”:

groups:
  - name: api-slo.rules
    rules:
      - record: "slo:api_requests_total:rate5m"
        expr: 'sum by (service) (rate(http_requests_total{job="api"}[5m]))'
      - record: "slo:api_requests_good:rate5m"
        expr: 'sum by (service) (rate(http_requests_total{job="api", code!~"5.."}[5m]))'

The old alert said “5xx rate over 1%,” so “good” is non-5xx and the implied target is 99%. I make the model state that inference explicitly — “this rule implies a 99% availability target; confirm that’s the SLO you want” — because the migration is a chance to set an intentional target rather than inheriting an accidental one. That’s a human call.

Step three: generate the multi-window burn-rate alert

With the SLI recorded, the burn-rate alert follows the standard multi-window pattern. I let AI generate it and review every constant:

- alert: "ApiErrorBudgetFastBurn"
  expr: |
    (1 - (slo:api_requests_good:rate5m / slo:api_requests_total:rate5m)) > (14.4 * 0.01)
    and
    (1 - (slo:api_requests_good:rate1h / slo:api_requests_total:rate1h)) > (14.4 * 0.01)
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "API burning error budget 14.4x too fast (5m and 1h windows)"
    runbook_url: "https://runbooks.internal/api-slo"
- alert: "ApiErrorBudgetSlowBurn"
  expr: |
    (1 - (slo:api_requests_good:rate30m / slo:api_requests_total:rate30m)) > (6 * 0.01)
    and
    (1 - (slo:api_requests_good:rate6h / slo:api_requests_total:rate6h)) > (6 * 0.01)
  for: 15m
  labels:
    severity: ticket

The 14.4 and 6 burn-rate multipliers and the window pairs come straight from the Google SRE workbook. I make the model justify each multiplier against the budget math rather than trusting that it transcribed them correctly — transposed constants are a classic LLM slip. Note the requirement for matching rate1h, rate30m, and rate6h recording rules, which the model sometimes forgets to also generate. Deeper detail lives in multi-window burn-rate alerts for SLOs that work.

Pro Tip: Run the new burn-rate alerts in parallel with the old threshold alerts for a couple of weeks before deleting the old ones. Compare what each fires on against real incidents. If the new alert misses something the old one caught, your SLI definition is wrong — find out in shadow mode, not during the migration cutover.

Step four: shadow-run, then cut over

The migration’s riskiest moment is deleting the old alerts. I never do it on faith. The new burn-rate alerts run alongside the old threshold ones, routed to a low-priority channel, for two to four weeks. I compare: did they fire on the same real incidents? Did the new ones suppress the 3am blips the old ones paged on? Only after the shadow period proves the new alerts are at least as protective do I delete the old rules. AI helps me build a comparison query, but the cutover decision is mine, grounded in evidence.

Don’t forget the recording rules the alert depends on

The most common way an AI-assisted burn-rate migration breaks is that the model generates the alert expressions referencing rate1h, rate30m, and rate6h recording rules — but only generates the rate5m one, or none at all. The alert then evaluates to no-data forever, which in a migration is catastrophic because you’ve just deleted the working threshold alert it replaced. I make the dependency explicit and have the model enumerate every recording rule each alert needs before I deploy anything:

groups:
  - name: api-slo.rules
    rules:
      - record: "slo:api_requests_good:rate5m"
        expr: 'sum by (service) (rate(http_requests_total{job="api", code!~"5.."}[5m]))'
      - record: "slo:api_requests_good:rate1h"
        expr: 'sum by (service) (rate(http_requests_total{job="api", code!~"5.."}[1h]))'
      - record: "slo:api_requests_good:rate30m"
        expr: 'sum by (service) (rate(http_requests_total{job="api", code!~"5.."}[30m]))'
      - record: "slo:api_requests_good:rate6h"
        expr: 'sum by (service) (rate(http_requests_total{job="api", code!~"5.."}[6h]))'

Plus the matching _total rules at every window. That’s eight recording rules to support two burn-rate alerts, and the model will quietly under-generate them unless you make the count explicit. I cross-check that every window referenced in an alert has a corresponding recording rule, and promtool check rules in CI catches a dangling reference before it ships. This is exactly the kind of mechanical completeness check where a human verifying the AI’s output prevents a silent production gap.

Step five: keep it explainable

The whole point of the migration is to end up with alerts the team understands. For every new burn-rate alert I should be able to say: here’s the SLO, here’s the budget, here’s why this multiplier pages and that one tickets. If the only justification is “the AI converted it,” the migration failed — I’ve just swapped one inscrutable pile for another. Explainability is the deliverable, not a nicety.

The free Alert Rule Generator is a good scaffold for the individual burn-rate rules, and our code review dashboard catches missing recording-rule dependencies when the migration lands as a PR. I do the bulk classification and transformation in Claude, and review the diffs inline with Cursor.

Conclusion

Migrating legacy threshold alerts to burn-rate alerting is a textbook AI-assisted project: repetitive, pattern-driven transformation across many files, with a few genuine judgment calls — what stays a threshold, what the real SLO target should be, when it’s safe to cut over. Let the model classify and transform at speed, make it justify every burn-rate constant, shadow-run before deleting anything, and insist on alerts you can explain. Do that and ninety inscrutable rules become a handful of alerts that map to actual user pain. More in SLOs and error budgets with Prometheus and the monitoring category.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.