Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

SLO Error Budget & Multi-Window Burn Rate Alerts Prompt

Design SLO-based alerts — error budgets, multi-burn-rate alerting, SLI selection, burn budget calculation.

Target user
SREs adopting modern SLO-based alerting
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior SRE who has implemented SLO-based alerting per Google's SRE book — multi-window burn rate, error budgets, fewer but more meaningful pages.

I will provide:
- The service and its SLO target (e.g., 99.9% over 30d)
- Current alerting model
- Goal: design SLO alerts / migrate from threshold-based

Your job:

1. **SLO basics**:
   - **SLI** — Service Level Indicator (latency, availability)
   - **SLO** — target (e.g., 99.9% availability over 30 days)
   - **Error budget** — 0.1% over 30d = ~43 min downtime allowed
   - **Burn rate** — how fast budget is consuming
2. **Multi-window burn alert** (recommended pattern):
   - Two windows: short (recent) + long (sustained)
   - Both must be burning for alert to fire
   - Fast burn: catches sudden spikes (paging)
   - Slow burn: catches sustained issues (ticket)
3. **Burn rate calculation**:
   - 14.4× burn rate over 1h = consumes 2% of monthly budget
   - 6× burn rate over 6h = consumes 5% in 6 hours
   - 3× burn rate over 24h = consumes 10% in a day
4. **For Apdex / latency SLO**:
   - "X% of requests served in Y seconds"
   - Use histogram for tracking
5. **For availability SLO**:
   - "X% successful requests"
   - rate(http_requests_total{code!~"5.."}) / rate(http_requests_total)
6. **For recording rules**:
   - Pre-compute SLI per window
   - Burn rate as ratio
   - Multi-window alert combines
7. **For dashboard**:
   - SLO compliance over time
   - Error budget remaining
   - Burn rate gauge
8. **For SLO tooling**:
   - sloth, openslo, pyrra for SLO-as-code

Mark DESTRUCTIVE: setting SLO too high (constant alerting), removing multi-window check (flaky alerts), changing SLO during incident.

---

Service + SLO: [DESCRIBE]
Current alerting: [DESCRIBE]
Goal: [DESCRIBE]

Why this prompt works

SLO-based alerts are the modern SRE standard. This prompt walks design.

How to use it

  1. Define SLI precisely.
  2. Set SLO based on user need.
  3. Implement multi-window burn.
  4. Track error budget.

Useful commands

# SLI: availability (req success ratio)
sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Error budget over month (99.9% target)
1 - 0.999

# Burn rate
(1 - sli_value) / (1 - slo_target)
# > 1 means burning faster than allowed

Multi-window burn alert pattern

groups:
- name: slo-alerts
  rules:
  # Recording rules
  - record: slo:http_errors:rate5m
    expr: sum(rate(http_requests_total{code=~"5.."}[5m]))
  - record: slo:http_requests:rate5m
    expr: sum(rate(http_requests_total[5m]))
  - record: slo:http_error_ratio:rate5m
    expr: slo:http_errors:rate5m / slo:http_requests:rate5m

  - record: slo:http_error_ratio:rate1h
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[1h]))
        / sum(rate(http_requests_total[1h]))

  - record: slo:http_error_ratio:rate6h
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[6h]))
        / sum(rate(http_requests_total[6h]))

  - record: slo:http_error_ratio:rate24h
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[24h]))
        / sum(rate(http_requests_total[24h]))

  # Fast burn (2% budget in 1h) — paging
  - alert: ErrorBudgetBurnFast
    expr: |
      (
        slo:http_error_ratio:rate5m > (14.4 * 0.001)
        and
        slo:http_error_ratio:rate1h > (14.4 * 0.001)
      )
    for: 2m
    labels:
      severity: critical
      slo: 99.9
    annotations:
      summary: "Error budget burning fast (2% in 1h)"
      runbook: "https://runbooks.example.com/error-budget-fast"

  # Medium burn (5% in 6h) — paging
  - alert: ErrorBudgetBurnMedium
    expr: |
      (
        slo:http_error_ratio:rate30m > (6 * 0.001)
        and
        slo:http_error_ratio:rate6h > (6 * 0.001)
      )
    for: 15m
    labels:
      severity: critical
      slo: 99.9
    annotations:
      summary: "Error budget burning at medium rate (5% in 6h)"

  # Slow burn (10% in 24h) — ticket
  - alert: ErrorBudgetBurnSlow
    expr: |
      (
        slo:http_error_ratio:rate2h > (3 * 0.001)
        and
        slo:http_error_ratio:rate24h > (3 * 0.001)
      )
    for: 1h
    labels:
      severity: warning
      slo: 99.9

Common findings this catches

  • Constant pages → SLO unrealistic.
  • No long-window check → flaky single-window alerts.
  • SLI doesn’t reflect user experience → revise.
  • Error budget never replenished → SLO too tight or service genuinely failing.
  • Multiple SLOs conflicting → consolidate.
  • Burn rate spike during deploy → expected; tune.
  • Budget remaining negative — overspent; halt feature ship.

When to escalate

  • SLO target with business — strategic.
  • SLI definition with users — research.
  • SLO-as-code adoption — tooling.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week