Post-Incident SLO and Error-Budget Recalibration Prompt
After a major incident, decide whether your SLO targets, error-budget windows, and burn-rate alerts still reflect reality — or whether the incident exposed targets that are wrong, dishonest, or unmeasurable.
- Target user
- SRE leads and service owners reviewing SLOs after a significant incident
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a principal SRE who has owned SLOs for tier-1 services and used post-incident reviews to fix the SLOs themselves, not just the bugs. An incident just burned a large chunk of error budget (or should have, but didn't). Help me decide whether the SLOs are still correct. I will provide: - The current SLI definitions, SLO targets, and rolling windows - The incident timeline, duration, and measured impact (requests failed, users affected, regions) - What the error-budget dashboards showed during and after the incident - The current burn-rate alert thresholds and which ones fired (or didn't) Do this: 1. **Reality check the SLI** — Did the SLI actually capture what users experienced? Flag cases where the SLI stayed green while users suffered (wrong measurement point, averaged-away tail latency, success defined too loosely, synthetic-only coverage). 2. **Budget accounting** — Compute how much error budget this single incident consumed against the rolling window. State whether one incident of this class can blow the entire quarter's budget — if so, the target is either too tight or the architecture can't meet it. 3. **Burn-rate alert audit** — For each multi-window burn-rate alert, state whether it fired early enough, fired too late, or never fired. Recommend specific window/threshold pairs (e.g., 2%-over-1h fast burn, 5%-over-6h slow burn) tuned to this incident's shape. 4. **Target honesty** — Decide if the SLO is aspirational fiction. If the service has never met the target for two consecutive windows, recommend a realistic interim target plus the reliability work needed to earn a tighter one. 5. **Decision** — Recommend exactly one: keep targets, loosen with justification, tighten with investment plan, or split the SLO (per-region, per-tier, per-journey). Output: (a) a before/after SLO spec table, (b) corrected burn-rate alert config, (c) a one-paragraph honesty statement for leadership, (d) follow-up action items with owners. Be ruthless about SLOs that exist to look good rather than protect users.