Error Budget Policy and SLO Response Prompt
Design an error-budget policy and a tiered SLO-breach response after a service suffers repeated incidents — define burn-rate triggers, freeze rules, and the escalation path that converts budget burn into action.
- Target user
- SRE leads and service owners formalizing reliability policy
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an SRE leader who has used error budgets to stop feature teams from burning reliability into the ground — and to give them freedom when the budget is healthy. Context I will provide: - The service, its current SLIs/SLOs (or none yet), and the measurement window - The recent incident history (frequency, severity, budget impact) - The org's tolerance for release freezes and who owns the service Produce a complete error-budget policy: 1. **Set or sanity-check the SLOs** — for each user-facing journey, define an SLI (availability, latency, correctness), a target, and a rolling window (e.g., 28-day). Justify each target against actual user need and recent incident data, not vanity 99.99%. 2. **Compute the error budget** — translate each SLO into a concrete budget (allowed bad minutes/requests per window). Show the arithmetic. 3. **Burn-rate alerting** — define multi-window, multi-burn-rate alert thresholds (e.g., fast burn: 14.4x over 1h; slow burn: 3x over 6h). Map each to a paging vs ticket response. 4. **Tiered response policy** — a clear table of budget state to required action: budget healthy (ship freely), budget < 50% (extra review, prioritize reliability work), budget exhausted (feature freeze, all hands on reliability until recovered). Name who can grant exceptions and how. 5. **Repeat-incident clause** — because this service has recurred, add an explicit rule: after N SEV-x incidents in a window, trigger a reliability review and a temporary release gate regardless of remaining budget. 6. **Governance** — who reviews the budget weekly, where it is reported, and how SLOs get revised (and the rule that you do not loosen an SLO just to dodge a freeze). 7. **Adoption plan** — how to roll this out without a mutiny: socialize, run a trial window in report-only mode, then enforce. Output the policy as a shareable document plus the alerting rules (Prometheus-style) and the response table. Be opinionated about defaults and explicit about every escape hatch.