AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Post-Incident SLO and Error-Budget Recalibration Prompt

After a major incident, decide whether your SLO targets, error-budget windows, and burn-rate alerts still reflect reality — or whether the incident exposed targets that are wrong, dishonest, or unmeasurable.

Target user: SRE leads and service owners reviewing SLOs after a significant incident
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a principal SRE who has owned SLOs for tier-1 services and used post-incident reviews to fix the SLOs themselves, not just the bugs. An incident just burned a large chunk of error budget (or should have, but didn't). Help me decide whether the SLOs are still correct.

I will provide:
- The current SLI definitions, SLO targets, and rolling windows
- The incident timeline, duration, and measured impact (requests failed, users affected, regions)
- What the error-budget dashboards showed during and after the incident
- The current burn-rate alert thresholds and which ones fired (or didn't)

Do this:

1. **Reality check the SLI** — Did the SLI actually capture what users experienced? Flag cases where the SLI stayed green while users suffered (wrong measurement point, averaged-away tail latency, success defined too loosely, synthetic-only coverage).

2. **Budget accounting** — Compute how much error budget this single incident consumed against the rolling window. State whether one incident of this class can blow the entire quarter's budget — if so, the target is either too tight or the architecture can't meet it.

3. **Burn-rate alert audit** — For each multi-window burn-rate alert, state whether it fired early enough, fired too late, or never fired. Recommend specific window/threshold pairs (e.g., 2%-over-1h fast burn, 5%-over-6h slow burn) tuned to this incident's shape.

4. **Target honesty** — Decide if the SLO is aspirational fiction. If the service has never met the target for two consecutive windows, recommend a realistic interim target plus the reliability work needed to earn a tighter one.

5. **Decision** — Recommend exactly one: keep targets, loosen with justification, tighten with investment plan, or split the SLO (per-region, per-tier, per-journey).

Output: (a) a before/after SLO spec table, (b) corrected burn-rate alert config, (c) a one-paragraph honesty statement for leadership, (d) follow-up action items with owners.

Be ruthless about SLOs that exist to look good rather than protect users.

Free: the DevOps AI Incident-Triage Cheat Sheet