You are a senior SRE who has implemented SLO-based alerting per Google's SRE book — multi-window burn rate, error budgets, fewer but more meaningful pages. I will provide: - The service and its SLO target (e.g., 99.9% over 30d) - Current alerting model - Goal: design SLO alerts / migrate from threshold-based Your job: 1. **SLO basics**: - **SLI** — Service Level Indicator (latency, availability) - **SLO** — target (e.g., 99.9% availability over 30 days) - **Error budget** — 0.1% over 30d = ~43 min downtime allowed - **Burn rate** — how fast budget is consuming 2. **Multi-window burn alert** (recommended pattern): - Two windows: short (recent) + long (sustained) - Both must be burning for alert to fire - Fast burn: catches sudden spikes (paging) - Slow burn: catches sustained issues (ticket) 3. **Burn rate calculation**: - 14.4× burn rate over 1h = consumes 2% of monthly budget - 6× burn rate over 6h = consumes 5% in 6 hours - 3× burn rate over 24h = consumes 10% in a day 4. **For Apdex / latency SLO**: - "X% of requests served in Y seconds" - Use histogram for tracking 5. **For availability SLO**: - "X% successful requests" - rate(http_requests_total{code!~"5.."}) / rate(http_requests_total) 6. **For recording rules**: - Pre-compute SLI per window - Burn rate as ratio - Multi-window alert combines 7. **For dashboard**: - SLO compliance over time - Error budget remaining - Burn rate gauge 8. **For SLO tooling**: - sloth, openslo, pyrra for SLO-as-code Mark DESTRUCTIVE: setting SLO too high (constant alerting), removing multi-window check (flaky alerts), changing SLO during incident. --- Service + SLO: [DESCRIBE] Current alerting: [DESCRIBE] Goal: [DESCRIBE]

Why this prompt works

SLO-based alerts are the modern SRE standard. This prompt walks design.

How to use it

Define SLI precisely.
Set SLO based on user need.
Implement multi-window burn.
Track error budget.

Useful commands

# SLI: availability (req success ratio)
sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Error budget over month (99.9% target)
1 - 0.999

# Burn rate
(1 - sli_value) / (1 - slo_target)
# > 1 means burning faster than allowed

Multi-window burn alert pattern

groups:
- name: slo-alerts
  rules:
  # Recording rules
  - record: slo:http_errors:rate5m
    expr: sum(rate(http_requests_total{code=~"5.."}[5m]))
  - record: slo:http_requests:rate5m
    expr: sum(rate(http_requests_total[5m]))
  - record: slo:http_error_ratio:rate5m
    expr: slo:http_errors:rate5m / slo:http_requests:rate5m

  - record: slo:http_error_ratio:rate1h
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[1h]))
        / sum(rate(http_requests_total[1h]))

  - record: slo:http_error_ratio:rate6h
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[6h]))
        / sum(rate(http_requests_total[6h]))

  - record: slo:http_error_ratio:rate24h
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[24h]))
        / sum(rate(http_requests_total[24h]))

  # Fast burn (2% budget in 1h) — paging
  - alert: ErrorBudgetBurnFast
    expr: |
      (
        slo:http_error_ratio:rate5m > (14.4 * 0.001)
        and
        slo:http_error_ratio:rate1h > (14.4 * 0.001)
      )
    for: 2m
    labels:
      severity: critical
      slo: 99.9
    annotations:
      summary: "Error budget burning fast (2% in 1h)"
      runbook: "https://runbooks.example.com/error-budget-fast"

  # Medium burn (5% in 6h) — paging
  - alert: ErrorBudgetBurnMedium
    expr: |
      (
        slo:http_error_ratio:rate30m > (6 * 0.001)
        and
        slo:http_error_ratio:rate6h > (6 * 0.001)
      )
    for: 15m
    labels:
      severity: critical
      slo: 99.9
    annotations:
      summary: "Error budget burning at medium rate (5% in 6h)"

  # Slow burn (10% in 24h) — ticket
  - alert: ErrorBudgetBurnSlow
    expr: |
      (
        slo:http_error_ratio:rate2h > (3 * 0.001)
        and
        slo:http_error_ratio:rate24h > (3 * 0.001)
      )
    for: 1h
    labels:
      severity: warning
      slo: 99.9

Common findings this catches

Constant pages → SLO unrealistic.
No long-window check → flaky single-window alerts.
SLI doesn’t reflect user experience → revise.
Error budget never replenished → SLO too tight or service genuinely failing.
Multiple SLOs conflicting → consolidate.
Burn rate spike during deploy → expected; tune.
Budget remaining negative — overspent; halt feature ship.

When to escalate

SLO target with business — strategic.
SLI definition with users — research.
SLO-as-code adoption — tooling.

SLO Error Budget & Multi-Window Burn Rate Alerts Prompt

Why this prompt works

How to use it

Useful commands

Multi-window burn alert pattern

Common findings this catches

When to escalate

Related prompts

Alert Fatigue Reduction Strategy Prompt

Prometheus Alert Rule Generator Prompt

PromQL Histogram & Quantile Calculation Prompt

Why this prompt works

How to use it

Useful commands

Multi-window burn alert pattern

Common findings this catches

When to escalate

Related prompts

Alert Fatigue Reduction Strategy Prompt

Prometheus Alert Rule Generator Prompt

PromQL Histogram & Quantile Calculation Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet