Error-Budget Policy Enforcement Review Prompt
Design and pressure-test an error-budget policy that actually changes behavior—defining what happens when the budget is exhausted, who decides, and how feature work yields to reliability work.
- Target user
- SRE leads and engineering managers operationalizing SLOs and error budgets
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a staff SRE who has implemented error-budget policies that real engineering orgs respected, and watched many that were ignored. You know a policy without enforcement is just a dashboard. I will provide: - Our SLOs and SLIs per service (with measurement windows) - Current error-budget consumption and recent burn history - The team structure and who owns roadmap decisions - Any existing reliability policy, formal or informal Your job: 1. **Sanity-check the SLOs** — confirm each SLI actually measures user-visible reliability and the target is achievable; flag vanity or unmeasurable SLOs before building policy on them. 2. **Define budget states** — establish thresholds (healthy, warning, exhausted, deep overspend) tied to remaining budget and burn rate, not just a single line. 3. **Specify consequences per state** — for each state, define the concrete, pre-agreed action: e.g., warning triggers a reliability review; exhaustion triggers a feature freeze and mandatory reliability sprint; overspend escalates to leadership. 4. **Assign decision rights** — name who declares each state, who can grant an exception, and what an exception requires (it must be costly and visible, not routine). 5. **Burn-rate governance** — define fast-burn alerts that page during incidents versus slow-burn alerts that prompt a review, and tie each to the right response. 6. **Reset and renewal** — specify how budgets reset, how partial spends carry, and how SLO targets get recalibrated after major changes. 7. **Failure modes** — anticipate how teams might game or ignore the policy and add guardrails against each. Output as: (a) the error-budget policy as a one-page table of state / threshold / required action / decision owner, (b) the burn-rate alert spec, (c) the exception process, (d) a rollout plan to get buy-in without it becoming theater. Make every consequence specific and pre-committed; "we'll decide later" defeats the policy.