SLO Error Budget & Multi-Window Burn Rate Alerts Prompt
Design SLO-based alerts — error budgets, multi-burn-rate alerting, SLI selection, burn budget calculation.
- Target user
- SREs adopting modern SLO-based alerting
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has implemented SLO-based alerting per Google's SRE book — multi-window burn rate, error budgets, fewer but more meaningful pages.
I will provide:
- The service and its SLO target (e.g., 99.9% over 30d)
- Current alerting model
- Goal: design SLO alerts / migrate from threshold-based
Your job:
1. **SLO basics**:
- **SLI** — Service Level Indicator (latency, availability)
- **SLO** — target (e.g., 99.9% availability over 30 days)
- **Error budget** — 0.1% over 30d = ~43 min downtime allowed
- **Burn rate** — how fast budget is consuming
2. **Multi-window burn alert** (recommended pattern):
- Two windows: short (recent) + long (sustained)
- Both must be burning for alert to fire
- Fast burn: catches sudden spikes (paging)
- Slow burn: catches sustained issues (ticket)
3. **Burn rate calculation**:
- 14.4× burn rate over 1h = consumes 2% of monthly budget
- 6× burn rate over 6h = consumes 5% in 6 hours
- 3× burn rate over 24h = consumes 10% in a day
4. **For Apdex / latency SLO**:
- "X% of requests served in Y seconds"
- Use histogram for tracking
5. **For availability SLO**:
- "X% successful requests"
- rate(http_requests_total{code!~"5.."}) / rate(http_requests_total)
6. **For recording rules**:
- Pre-compute SLI per window
- Burn rate as ratio
- Multi-window alert combines
7. **For dashboard**:
- SLO compliance over time
- Error budget remaining
- Burn rate gauge
8. **For SLO tooling**:
- sloth, openslo, pyrra for SLO-as-code
Mark DESTRUCTIVE: setting SLO too high (constant alerting), removing multi-window check (flaky alerts), changing SLO during incident.
---
Service + SLO: [DESCRIBE]
Current alerting: [DESCRIBE]
Goal: [DESCRIBE]
Why this prompt works
SLO-based alerts are the modern SRE standard. This prompt walks design.
How to use it
- Define SLI precisely.
- Set SLO based on user need.
- Implement multi-window burn.
- Track error budget.
Useful commands
# SLI: availability (req success ratio)
sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Error budget over month (99.9% target)
1 - 0.999
# Burn rate
(1 - sli_value) / (1 - slo_target)
# > 1 means burning faster than allowed
Multi-window burn alert pattern
groups:
- name: slo-alerts
rules:
# Recording rules
- record: slo:http_errors:rate5m
expr: sum(rate(http_requests_total{code=~"5.."}[5m]))
- record: slo:http_requests:rate5m
expr: sum(rate(http_requests_total[5m]))
- record: slo:http_error_ratio:rate5m
expr: slo:http_errors:rate5m / slo:http_requests:rate5m
- record: slo:http_error_ratio:rate1h
expr: |
sum(rate(http_requests_total{code=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
- record: slo:http_error_ratio:rate6h
expr: |
sum(rate(http_requests_total{code=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
- record: slo:http_error_ratio:rate24h
expr: |
sum(rate(http_requests_total{code=~"5.."}[24h]))
/ sum(rate(http_requests_total[24h]))
# Fast burn (2% budget in 1h) — paging
- alert: ErrorBudgetBurnFast
expr: |
(
slo:http_error_ratio:rate5m > (14.4 * 0.001)
and
slo:http_error_ratio:rate1h > (14.4 * 0.001)
)
for: 2m
labels:
severity: critical
slo: 99.9
annotations:
summary: "Error budget burning fast (2% in 1h)"
runbook: "https://runbooks.example.com/error-budget-fast"
# Medium burn (5% in 6h) — paging
- alert: ErrorBudgetBurnMedium
expr: |
(
slo:http_error_ratio:rate30m > (6 * 0.001)
and
slo:http_error_ratio:rate6h > (6 * 0.001)
)
for: 15m
labels:
severity: critical
slo: 99.9
annotations:
summary: "Error budget burning at medium rate (5% in 6h)"
# Slow burn (10% in 24h) — ticket
- alert: ErrorBudgetBurnSlow
expr: |
(
slo:http_error_ratio:rate2h > (3 * 0.001)
and
slo:http_error_ratio:rate24h > (3 * 0.001)
)
for: 1h
labels:
severity: warning
slo: 99.9
Common findings this catches
- Constant pages → SLO unrealistic.
- No long-window check → flaky single-window alerts.
- SLI doesn’t reflect user experience → revise.
- Error budget never replenished → SLO too tight or service genuinely failing.
- Multiple SLOs conflicting → consolidate.
- Burn rate spike during deploy → expected; tune.
- Budget remaining negative — overspent; halt feature ship.
When to escalate
- SLO target with business — strategic.
- SLI definition with users — research.
- SLO-as-code adoption — tooling.
Related prompts
-
Alert Fatigue Reduction Strategy Prompt
Reduce alert fatigue — SLO-based alerts vs symptom-based, severity tiers, runbook integration, deprecating noisy alerts.
-
Prometheus Alert Rule Generator Prompt
Generate production-quality Prometheus alerting rules with sensible thresholds, labels, and runbook annotations.
-
PromQL Histogram & Quantile Calculation Prompt
Use Prometheus histograms correctly — `histogram_quantile`, bucket bounds, p99 latency calculation, histogram vs summary, native histograms.