Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Incident Response By James Joyner IV · · 9 min read

Handling SLO and SLA Breaches: From Error Budgets to Customer Credits

An SLO breach is an engineering signal; an SLA breach is a contractual one. Here's how to handle both without panic, and how AI helps assess and communicate them.

  • #incident-response
  • #slo
  • #sla
  • #error-budget
  • #sre
  • #reliability

There’s a moment in a long incident when someone asks, “Are we going to breach the SLA?” and the room goes quiet, because nobody actually knows. They don’t know what the SLA says, whether it’s the same thing as the SLO, or what happens if you blow through it. That confusion costs you — both in the incident and in the awkward customer conversation afterward.

SLOs and SLAs are related but they are not the same, and conflating them is how teams either panic over a number that doesn’t matter or sleepwalk past one that does. Let me untangle them.

SLI, SLO, SLA — the three that get confused

  • SLI (Service Level Indicator) — the actual measurement. “Percentage of HTTP requests that succeed.”
  • SLO (Service Level Objective) — your internal target for that indicator. “99.9% of requests succeed over 30 days.” This is an engineering goal you set for yourselves.
  • SLA (Service Level Agreement) — a contractual promise to a customer, with consequences. “99.5% uptime monthly, or you get a credit.”

The critical insight: your SLO should be stricter than your SLA. The gap between them is your safety margin. If your SLA promises 99.5% and your internal SLO is 99.9%, you start responding to reliability problems long before you’re anywhere near a contractual breach. Set them equal and every SLO miss is also a contract violation — you’ve left yourself no room.

The error budget is the tool, not the SLO

The SLO implies an error budget: the amount of failure you’re allowed before you miss the target. A 99.9% monthly SLO gives you about 43 minutes of downtime a month. That budget is a resource you spend deliberately.

During an incident, the error budget answers “how hard do we push?” If you’ve barely touched the budget this month, a brief degradation is fine — don’t take risky emergency actions over it. If you’re nearly out of budget, every additional minute matters and the calculus changes.

Track it as a simple burn:

WindowBudgetConsumedRemainingBurn rate
30 days43 min31 min12 min2.1x

A burn rate above 1x means you’re on pace to exhaust the budget before the window closes. That’s your early-warning signal — fast burn-rate alerts (e.g. “consuming 2 weeks of budget in an hour”) page you for problems that symptom thresholds alone might miss.

When you breach the SLO

An SLO breach is an internal event with no customer obligation attached. The right response is policy, not panic. The most useful policy is an error-budget policy agreed in advance:

  • Budget healthy → ship features at normal pace.
  • Budget exhausted → freeze risky changes, redirect engineering to reliability work until you’re back in budget.

This takes the “should we keep shipping?” argument out of the heat of the moment and makes it a pre-agreed rule. It also gives reliability work the political cover it usually lacks — you’re not asking for permission, you’re following the policy everyone signed off on.

When you breach the SLA

An SLA breach is a different animal because there’s a contract and money involved. Before it happens, you should already know:

  1. What the SLA measures — uptime? p99 latency? success rate? Measured how, over what window, excluding what (maintenance windows, customer-caused errors)?
  2. What the remedy is — service credits are the usual one. A 99.5% SLA might give 10% credit for 99.0–99.5%, scaling up as availability drops.
  3. Who declares the breach and who approves credits — usually engineering confirms the numbers, account management owns the customer conversation.

During the incident, someone needs to be watching the SLA clock, not just the SLO. The two have different windows and different stakes, and the SLA one comes with a bill.

The post-breach customer conversation

When you’ve breached a contractual SLA, proactive beats reactive. Don’t wait for the customer to file a claim. Reach out with:

  • An acknowledgment of the breach and the measured impact.
  • The credit you’re applying (ideally before they ask).
  • A brief, honest explanation and what you’re doing about it.

A customer who gets a proactive credit and a straight explanation often comes away more confident in you. A customer who has to fight for a credit they’re contractually owed is a customer who’s already shopping for alternatives.

Where AI helps

This is analysis and communication over numbers and text — ideal AI territory, no production access needed.

Assessing the breach. Paste the incident timeline and your SLO/SLA definitions:

“Our SLA promises 99.5% monthly availability measured as successful-request rate, excluding scheduled maintenance. Here’s the incident timeline and the affected request counts. Did we breach the SLO (99.9%) and/or the SLA (99.5%) for this customer this month? Show the math, and tell me the remaining error budget.”

The model is good at the bookkeeping you don’t want to fumble under stress — converting durations and request counts into availability percentages and budget consumed. Give it your real definitions so it uses your exclusions correctly. We keep SLO and error-budget prompts for this.

Drafting the policy and the customer note. Have it draft an error-budget policy you can adapt, and the proactive credit communication once you’ve confirmed the numbers.

One guardrail: the model can do the arithmetic, but a human confirms the contractual interpretation before any credit is promised. SLA language has edge cases — measurement windows, exclusions, definitions of “downtime” — that you don’t want a model deciding unilaterally.

The mindset

SLOs and SLAs aren’t bureaucracy — they’re the language that connects engineering reliability to business commitments. Get the SLO stricter than the SLA, treat the error budget as a real resource, agree the policy before the incident, and handle breaches as a process rather than a fire drill. Do that and the question “are we going to breach?” gets a calm, numerate answer instead of a quiet room.

If you want the structured version — paste your definitions and timeline, get the breach math and a draft customer note — that’s part of what the AI Incident Response Assistant is built to do.

Generated breach assessments and communications are assistive, not authoritative. Always confirm contractual interpretation and credit amounts with a human before committing to a customer.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.