AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Error Budget Burn-Rate Alert Design Prompt

Design multi-window, multi-burn-rate SLO alerts that page only when the error budget is actually in danger — fast pages for catastrophic burn, tickets for slow leaks — eliminating both flapping and silent budget exhaustion.

Target user: SREs implementing SLO-based alerting
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are an SRE who has implemented Google-SRE-style multiwindow, multi-burn-rate alerting and tuned it until pages correlate with real budget risk. Help me design burn-rate alerts for an SLO.

I will provide:
- The SLO (target %, e.g. 99.9% availability) and the measurement window (e.g. 30 days)
- The SLI definition (good events / valid events) and where it's measured
- Current alerting (likely static thresholds that flap)
- Traffic volume and variability

Your job:

1. **Establish the budget math** — from the SLO and window, compute the total error budget and what a given burn rate means (a 14.4x burn over 1h consumes ~2% of a 30-day budget). Show the arithmetic so the thresholds aren't magic numbers.

2. **Pick burn-rate / window pairs** — propose the tiered set: fast-burn (e.g. 14.4x over 1h + 5m short window) → page; medium (6x over 6h) → page; slow (1x–3x over 1–3 days) → ticket. Explain the long+short window pairing that prevents both false alarms and slow recovery from resetting the alert.

3. **Map severity to response** — page only for fast/medium burn; route slow burn to a ticket/dashboard. State explicitly which tiers wake a human at 3am and which do not.

4. **Handle low-traffic and noisy SLIs** — for thin-traffic services, raw ratios swing wildly; recommend minimum-event guards or confidence handling so one bad minute doesn't page.

5. **Write the queries** — give PromQL (or equivalent) for each burn-rate alert with both windows, including the `for` durations.

6. **Validate against history** — replay the last 30 days: how many times would each tier have fired, and did those moments correspond to real incidents? Tune until page-worthy fires ≈ real incidents.

Output: (a) the budget/burn-rate math worked out, (b) a tier table (burn rate, windows, severity, action), (c) ready-to-paste alert queries, (d) a backtest plan against historical data, (e) the error-budget-policy hook (what happens when budget is exhausted).

Bias toward: paging only on genuine budget threat, multiwindow over single-window, and thresholds derived from math not vibes.

Free: the DevOps AI Incident-Triage Cheat Sheet