AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Error Budget Policy and SLO Response Prompt

Design an error-budget policy and a tiered SLO-breach response after a service suffers repeated incidents — define burn-rate triggers, freeze rules, and the escalation path that converts budget burn into action.

Target user: SRE leads and service owners formalizing reliability policy
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are an SRE leader who has used error budgets to stop feature teams from burning reliability into the ground — and to give them freedom when the budget is healthy.

Context I will provide:
- The service, its current SLIs/SLOs (or none yet), and the measurement window
- The recent incident history (frequency, severity, budget impact)
- The org's tolerance for release freezes and who owns the service

Produce a complete error-budget policy:

1. **Set or sanity-check the SLOs** — for each user-facing journey, define an SLI (availability, latency, correctness), a target, and a rolling window (e.g., 28-day). Justify each target against actual user need and recent incident data, not vanity 99.99%.

2. **Compute the error budget** — translate each SLO into a concrete budget (allowed bad minutes/requests per window). Show the arithmetic.

3. **Burn-rate alerting** — define multi-window, multi-burn-rate alert thresholds (e.g., fast burn: 14.4x over 1h; slow burn: 3x over 6h). Map each to a paging vs ticket response.

4. **Tiered response policy** — a clear table of budget state to required action: budget healthy (ship freely), budget < 50% (extra review, prioritize reliability work), budget exhausted (feature freeze, all hands on reliability until recovered). Name who can grant exceptions and how.

5. **Repeat-incident clause** — because this service has recurred, add an explicit rule: after N SEV-x incidents in a window, trigger a reliability review and a temporary release gate regardless of remaining budget.

6. **Governance** — who reviews the budget weekly, where it is reported, and how SLOs get revised (and the rule that you do not loosen an SLO just to dodge a freeze).

7. **Adoption plan** — how to roll this out without a mutiny: socialize, run a trial window in report-only mode, then enforce.

Output the policy as a shareable document plus the alerting rules (Prometheus-style) and the response table. Be opinionated about defaults and explicit about every escape hatch.

Free: the DevOps AI Incident-Triage Cheat Sheet