Cutting Escalation Time With AI: Page the Right Expert

The on-call had been stuck on the same wall for twenty-five minutes — same dead ends, same two hypotheses, no new evidence — before anyone said the word “escalate.” Then we paged the wrong team, because the service ownership lived in someone’s memory and that someone was asleep. Another ten minutes gone before the actual owner picked up, scrolled the channel cold, and asked the question we’d answered an hour ago. Escalation is supposed to accelerate resolution, but the dead time between “we’re stuck” and “the right expert is contributing” is one of the most overlooked stretches of MTTR.

Two decisions drive that dead time: when to escalate and to whom. Both are judgment calls a model can sharpen — reading whether the team has genuinely plateaued, and matching the narrowed scope to the most likely owner from your ownership map — while the human keeps control of who actually gets paged.

Escalating late and escalating wrong both cost minutes

Teams lose time at both ends. They escalate too late because nobody wants to admit they’re stuck, so an incident that needed an expert at minute ten doesn’t get one until minute forty. And they escalate wrong because the ownership map isn’t at hand, paging a team that hands it off again. Closing both gaps is the same friction-removal that the MTTR category keeps coming back to — get the right person in fast, with context, so they contribute immediately.

A model handed your investigation history and ownership map can read the plateau and route the page, which is the part that’s slow when it depends on tribal knowledge at 3 a.m.

Ask for an escalate-or-not call plus a routed page

The framing keeps the human on the trigger.

You are advising on escalation, not paging anyone. Here’s what we’ve tried and the results, the narrowed scope, our ownership map, and time elapsed. Tell me: are we still making progress or have we plateaued (cite the evidence)? If we should escalate, who’s the most likely owner for this scope, with a backup? Draft the minimal context packet they need to contribute in their first minute, and a ready-to-send page with severity, the one question for them, and a channel link. Give a fallback if the first owner can’t be reached in 10 minutes.

The output is a routed, ready escalation:

Escalate? Yes. Two repeated dead ends, no new evidence in 20 min, and the scope (Postgres replication lag) is beyond the app on-call’s depth. Primary: data-platform on-call (owns the Postgres clusters). Backup: Priya (replication SME). Context packet: SEV2, checkout latency from replica lag on orders-db; confirmed lag at 8s, ruled out app-side query change; question for you: is this a replication-stream problem or a disk-IO problem on the replica? Page: “@data-platform SEV2 — checkout degraded by 8s orders-db replica lag, app side ruled out. Need: replica-stream vs disk-IO read. Channel: #inc-4412.” Fallback: no ack in 10 min → page Priya directly.

The expert arrives knowing the scope and the exact question, contributing in their first minute instead of their tenth.

Verify the owner is actually on-call

# Confirm who's actually on-call for the routed team before paging
curl -s -H "Authorization: Token token=$PD_TOKEN" \
  "https://api.pagerduty.com/oncalls?schedule_ids[]=$DATA_PLATFORM_SCHED" \
  | jq -r '.oncalls[] | "\(.escalation_level): \(.user.summary)"'

# Quick scope confirmation to put in the context packet (read-only)
kubectl exec -n orders deploy/orders-db-replica -- \
  psql -tc "select now() - pg_last_xact_replay_timestamp() as lag;"

The ownership-map match is only as good as the map, and maps go stale — rotations change, teams reorg, services get reassigned. So the suggested owner is a candidate to confirm against the live on-call schedule, not a name to page on the model’s word. That one check prevents the classic miss of paging someone who rotated off last week.

The line: the model reads the plateau and drafts the page; the human decides whether and whom to escalate. Escalation has a real cost — pulling someone off other work, waking them — so “yes, escalate” must never become a reflex. The model weighs genuine plateau against mid-progress, but the IC owns that judgment.

Rules I hold to:

Confirm the owner is on-call before paging. A stale map points you confidently at the wrong person and wastes the minutes escalation was meant to save.
Always send the context packet with the page. An expert who has to re-derive the situation contributes late; the ruled-out list is what lets them skip the rediscovery.
Don’t default to escalating. If the team is mid-progress, the right move may be to keep going — the model advises, the IC decides.

You can practice this on the free incident assistant — paste your investigation history and ownership map and ask for the escalate-or-not call plus the routed page, then notice how the context packet changes how fast the next person engages. The prompt library has a hardened escalation-accelerator prompt with the verify-the-owner guardrail built in.

Escalation is meant to speed resolution, but late and misrouted escalations quietly stretch MTTR. AI sharpens the two decisions that matter — when the team has plateaued and who owns the narrowed scope — and ships a context packet so the expert engages immediately. As long as the human confirms the owner is on-call and owns the trigger, you cut the dead time between stuck and solved without paging on autopilot.

Escalating late and escalating wrong both cost minutes

Ask for an escalate-or-not call plus a routed page

Verify the owner is actually on-call

Recommend the escalation, don’t reflexively trigger it

Download the Free 500-Prompt DevOps AI Toolkit