Tracking SLO Breaches and Error Budgets During Incidents

Twenty minutes into a degradation, someone on the bridge asked the question that should drive every incident decision but almost never gets answered in real time: “how much of our error budget are we burning right now?” Nobody knew. We had SLOs defined in a doc somewhere, but translating “checkout is at 94 percent success for the last twelve minutes” into “we have just spent a third of our monthly budget” required math nobody could do under pressure. That gap between having SLOs and using them during incidents is where AI has become genuinely useful for my team.

SLOs that only exist on paper

Plenty of teams define SLOs and then never look at them during the moments that matter. The objectives sit in a wiki, reviewed quarterly, completely disconnected from the live incident where they should be informing urgency. The reason is friction: computing real-time error-budget burn from current metrics is fiddly, and under incident stress nobody has the bandwidth to do it.

The consequence is that incident urgency gets decided by vibes instead of budget. A loud incident with little budget impact gets all the attention while a quiet one silently exhausts the month’s allowance. AI helps close this gap by doing the synthesis and arithmetic that humans cannot do reliably mid-incident.

Real-time burn synthesis

During an incident, I give a tool like Claude the SLO definition, the budget remaining at the start of the period, and the current and recent success rates, and ask it to estimate how fast we are burning budget and how long until exhaustion at the current rate. Instead of guessing, the bridge now hears “at the current 6 percent error rate, we exhaust the monthly budget in roughly forty minutes.”

That number changes the conversation immediately. It converts “this seems bad” into “we have forty minutes before we breach, so the rollback decision needs to happen now.” The model is doing arithmetic and synthesis, not making the call — but the synthesis is what makes the call possible.

Pro Tip: Have the model express burn in time-to-exhaustion, not just percentage consumed. “We have spent 40 percent of the budget” is abstract under stress; “we breach in 38 minutes at this rate” creates the urgency and the deadline that drive a clear decision.

Distinguishing the budget that matters

Not all SLO breaches are equal, and AI helps the bridge keep that straight. A breach on an internal batch-processing SLO is very different from a breach on the customer-facing checkout SLO. When several services are degraded at once, I ask the model to rank the budget impact by business criticality so the team focuses its limited attention where the burn actually hurts.

This ranking, grounded in the SLO definitions I provide, keeps the incident response proportionate. It stops the team from pouring effort into a noisy but low-stakes breach while the one that triggers customer SLA credits burns quietly. The human commander prioritizes; the model supplies the structured comparison.

Connecting budget to the rollback call

Error-budget burn is the cleanest input to the roll-back-or-hold decision, and pairing it with AI synthesis makes that decision faster. When the model reports that we will breach in ten minutes without intervention, the bridge knows it cannot wait for a perfect diagnosis — it acts on the best available hypothesis. When budget is healthy and burn is slow, there is room to investigate more carefully before acting.

I tie this to our live monitoring alerts so the burn estimate reflects current signals rather than a stale snapshot. The combination of real telemetry and AI synthesis gives the commander a grounded basis for the urgency of every decision.

AI computes, humans decide and act

The boundary is firm. AI computes and synthesizes budget impact; humans decide what to do about it. The model can tell you that you will breach in twelve minutes. It does not get to declare a budget freeze, halt deployments, page leadership, or initiate a rollback. Those are decisions and actions with real organizational consequences, and they belong to people who own the SLOs.

I am especially wary of automation that would let a budget-tracking system take actions like blocking deploys on its own judgment. A deterministic policy (“freeze deploys when budget is below 10 percent”) is a fine engineering control. But an LLM estimating burn and then acting on its own estimate combines two sources of uncertainty into one autonomous decision, which is exactly what you do not want touching production. The model informs; humans decide. The free AI Incident Response Assistant keeps strictly to the informing side.

Communicating budget impact afterward

Once the incident is over, error-budget impact is a key part of the story leadership and customers need. I ask the model to draft a clear summary of how much budget the incident consumed and what that means for the rest of the period — material that goes into the internal review and informs whether to slow the release cadence. As always, a human reviews and owns this before it goes anywhere.

I keep these budget-analysis prompts standardized in my prompt workspace so the framing is consistent across incidents, which makes month-over-month budget trends actually comparable. The prompts library has SRE-flavored templates worth adapting.

Conclusion

SLOs are only useful if they inform decisions in the moment, and the math that connects live metrics to error-budget burn is exactly what humans cannot do reliably under incident pressure. Use AI to compute real-time burn, express it as time-to-exhaustion, rank impact by business criticality, and draft the post-incident budget story. Then keep every decision and action — freezes, rollbacks, escalations — firmly in human hands. The model does the arithmetic; people own the SLOs and the calls. More SRE-grounded incident practices live in the incident-response category, with reusable templates in our prompt packs.

Tracking SLO Breaches and Error Budgets During Incidents With AI