Skip to content
CloudOps
Newsletter Sign up
All guides
AI for Incident Response By James Joyner IV · · 8 min read

Designing Incident Escalation Policies That Actually Reach Someone

An escalation policy fails the moment a page goes unanswered. A veteran SRE's guide to tiers, timeouts, fallbacks, and using AI to route the right severity faster.

  • #incident-response
  • #escalation
  • #on-call
  • #sre
  • #paging
  • #alerting

There’s a specific kind of nightmare in this job: a SEV1 firing into the void because the on-call person’s phone was on silent, the secondary was never configured, and the page just… sat there. Twenty minutes of customer impact accrued before anyone human saw it. The system worked exactly as designed — the design just had a hole in it.

An escalation policy exists for one job: guarantee that an incident reaches a human who can act, no matter who’s unavailable. After 25 years of relying on these policies, here’s how to build one without holes.

The core principle: never depend on one human

Any escalation path that ends with a single person is one silenced phone away from failure. Every tier needs a fallback, and the whole chain needs a final backstop that someone is contractually, unambiguously responsible for answering.

Build it in tiers with timeouts

A working escalation policy is a chain of “if no response in N minutes, try the next thing”:

  1. Primary on-call. Page immediately. If acknowledged, done.
  2. Secondary on-call. If the primary doesn’t ack within, say, 5 minutes, page the secondary automatically.
  3. Team lead / manager. If neither acks within another few minutes, escalate up.
  4. Backstop. A final, always-staffed tier — an incident manager, a follow-the-sun ops team — that will answer.

The exact timeouts depend on your severity. A SEV1 might escalate every 3–5 minutes; a SEV3 can be far more relaxed. The point is that unanswered pages move automatically. Manual escalation (“let me text my manager”) wastes the exact minutes you can’t afford.

Escalate by severity, not just by time

Time-based escalation handles the unanswered page. Severity-based escalation handles the scope. A SEV1 should pull in more people, faster, and notify leadership early — not because they’ll fix it, but because they need to know and may need to make business calls (customer comms, regulatory notification). Bake severity into the policy:

  • SEV1: page primary + notify secondary immediately; loop in IC and leadership early.
  • SEV2: page primary, escalate to secondary on timeout, owning-team lead aware.
  • SEV3/4: normal on-call, business-hours handling, no leadership noise.

Functional escalation: reaching the right expertise

Two kinds of escalation exist, and people conflate them:

  • Hierarchical escalation goes up — to get authority and resources.
  • Functional escalation goes sideways — to get the right expertise.

When the on-call generalist hits the limit of what they know, they need a clean path to the database expert or the network owner. Pre-define these subject-matter escalation contacts per system, or your responder spends precious minutes hunting for who understands the storage layer.

Test it, because it will rot

Escalation policies fail silently. The phone number changed, the person left, the integration broke — and you find out during the SEV1. Defend against it:

  • Run periodic test pages through the full chain, including the fallback tiers. If a tier doesn’t fire, you found a hole on a calm day.
  • Audit on every personnel change. Someone leaving the team should trigger an escalation-policy review.
  • Check the acknowledgment path actually works on each responder’s device. A page that doesn’t break through Do Not Disturb is not a page.

Where AI helps

Escalation is mostly automation plumbing, but AI sharpens the human edges of it.

Routing the right severity faster. The escalation a SEV1 deserves is wasted on a SEV3, and vice versa. Faster, more accurate severity assessment means the right escalation path fires. Paste the symptoms and get a structured severity read in seconds, so the responder isn’t guessing whether to wake leadership.

Pointing at the right expert. Describe the failing component and ask a model to reason about which subsystem and therefore which kind of expertise the incident likely needs — a hint for functional escalation when the on-call generalist is stuck.

Drafting the escalation message. When you do page up the chain, the recipient needs context fast. A model can turn the incident state into a tight “here’s what’s happening, here’s why I’m escalating, here’s what I need from you” summary so the woken-up senior person ramps in seconds, not minutes.

We keep incident-response prompts for severity assessment and escalation summaries, and the Incident Response tool produces the structured assessment that tells you whether this is an escalate-to-leadership event or a handle-it-yourself one.

The standard

Your escalation policy passes if a real page reaches a capable human within minutes even when the first one, two, and three people are unreachable. If you can’t say that with confidence, you have a hole — and the incident that finds it will be the worst possible time to learn about it. Test the chain on a calm day.

AI severity and escalation suggestions are advisory. A human owns the decision of who to wake and when.

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,300+ DevOps AI prompts
  • One practical workflow email per week