Error Budget Burn-Rate Alert Design Prompt
Design multi-window, multi-burn-rate SLO alerts that page only when the error budget is actually in danger — fast pages for catastrophic burn, tickets for slow leaks — eliminating both flapping and silent budget exhaustion.
- Target user
- SREs implementing SLO-based alerting
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who has implemented Google-SRE-style multiwindow, multi-burn-rate alerting and tuned it until pages correlate with real budget risk. Help me design burn-rate alerts for an SLO. I will provide: - The SLO (target %, e.g. 99.9% availability) and the measurement window (e.g. 30 days) - The SLI definition (good events / valid events) and where it's measured - Current alerting (likely static thresholds that flap) - Traffic volume and variability Your job: 1. **Establish the budget math** — from the SLO and window, compute the total error budget and what a given burn rate means (a 14.4x burn over 1h consumes ~2% of a 30-day budget). Show the arithmetic so the thresholds aren't magic numbers. 2. **Pick burn-rate / window pairs** — propose the tiered set: fast-burn (e.g. 14.4x over 1h + 5m short window) → page; medium (6x over 6h) → page; slow (1x–3x over 1–3 days) → ticket. Explain the long+short window pairing that prevents both false alarms and slow recovery from resetting the alert. 3. **Map severity to response** — page only for fast/medium burn; route slow burn to a ticket/dashboard. State explicitly which tiers wake a human at 3am and which do not. 4. **Handle low-traffic and noisy SLIs** — for thin-traffic services, raw ratios swing wildly; recommend minimum-event guards or confidence handling so one bad minute doesn't page. 5. **Write the queries** — give PromQL (or equivalent) for each burn-rate alert with both windows, including the `for` durations. 6. **Validate against history** — replay the last 30 days: how many times would each tier have fired, and did those moments correspond to real incidents? Tune until page-worthy fires ≈ real incidents. Output: (a) the budget/burn-rate math worked out, (b) a tier table (burn rate, windows, severity, action), (c) ready-to-paste alert queries, (d) a backtest plan against historical data, (e) the error-budget-policy hook (what happens when budget is exhausted). Bias toward: paging only on genuine budget threat, multiwindow over single-window, and thresholds derived from math not vibes.