Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Incident Response By James Joyner IV · · 8 min read

Onboarding New Engineers to On-Call Without Throwing Them to the Wolves

Putting a new engineer on the pager cold is how you create panic and turnover. Here's a structured on-call onboarding path that builds real confidence.

  • #incident-response
  • #on-call
  • #onboarding
  • #sre
  • #team-health
  • #mentorship

The worst on-call onboarding I ever saw was a single Slack message: “You’re on the rotation starting Monday, good luck.” The engineer’s first page was a SEV1 at 4am for a system they’d never seen. They survived it, barely, and then started interviewing elsewhere. We didn’t lose them to the incident — we lost them to being abandoned during it.

On-call is a learned skill, not a status you assign. Throwing someone onto the pager cold is both cruel to them and risky for your systems. Here’s how to build a path that turns a nervous newcomer into a calm responder.

The principle: graduated exposure

You don’t teach someone to swim by pushing them off the high dive. On-call onboarding works the same way — controlled, increasing exposure with a safety net that’s removed deliberately, not by accident. The path has four stages.

Stage 1: Shadow

The new engineer joins the rotation as a shadow — they get every page the primary gets, but they have zero responsibility to act. They watch how the primary triages, what dashboards they open, how they communicate, when they escalate.

The key here is the debrief. After each incident the shadow watched, the primary spends ten minutes walking through their reasoning: “Here’s why I looked at the database first, here’s why I didn’t restart.” Shadowing without debriefs is just watching someone type. The debrief is where the learning lives.

Stage 2: Reverse-shadow (primary with a net)

Now flip it: the new engineer is primary, but an experienced engineer shadows them as backup. The newcomer drives — acknowledges the page, triages, decides — but knows someone seasoned is right there to catch a mistake or take over if it gets scary.

This is the most important stage and the one teams skip. It’s where confidence is actually built, because the engineer makes real decisions on real incidents with a net underneath. Stay here longer than feels necessary.

Stage 3: Primary, daytime-weighted

Put them on the live rotation but stack the deck in their favor early: weight their first solo shifts toward business hours when the whole team is awake and reachable, and make sure a strong secondary is always paired with them. Their first 3am page should not be their first solo page.

Stage 4: Full rotation

They’re a full member, taking nights and weekends like everyone else, and now they’re equipped to shadow the next newcomer. The cycle continues.

What they need before stage 1

Graduated exposure only works if the fundamentals are in place first. Before a new engineer shadows their first shift, they should have:

  • Access, verified. Every system, dashboard, and tool — tested before the first shift, not discovered to be broken at 3am. A locked-out responder is useless.
  • The paging tool configured. Phone app installed, notifications set to override do-not-disturb, escalation path understood.
  • A map of the systems. A walkthrough of the architecture, the dependency and blast-radius map, and where the critical paths are.
  • The runbooks. Where they live, how to use them, and ideally a dry run of one or two on a non-incident day.
  • The escalation path. Who to call when they’re stuck, and explicit permission to call them. “Escalating early is correct behavior” needs to be said out loud.

The onboarding checklist

Make it concrete. A checklist beats good intentions:

On-Call Onboarding — [Name]

Access & tooling
[ ] Paging tool installed, test page received
[ ] All dashboards accessible
[ ] Production read access verified
[ ] VPN / bastion access tested

Knowledge
[ ] Architecture walkthrough completed
[ ] Dependency / blast-radius map reviewed
[ ] Runbook location & format understood
[ ] Two runbooks dry-run on a calm day
[ ] Severity definitions reviewed
[ ] Escalation path memorized

Practice
[ ] Shadowed >= 2 weeks (with debriefs)
[ ] Reverse-shadowed >= 2 incidents
[ ] Daytime-weighted solo shift completed

Ready for full rotation: [ ]  Signed off by: ______

Practice incidents beat real ones

You don’t have to wait for a real outage to train. Run practice incidents — pick a past incident, replay the alerts and symptoms, and let the new engineer work it with a mentor watching. They get the reps without the stakes.

This is a great place to use AI as a low-stakes drill partner. The newcomer can paste symptoms and ask for a triage plan, then compare it to what the mentor would do:

“I’m practicing incident triage. Symptom: checkout p99 latency jumped to 4s at 02:10, error rate normal. Here are the recent deploys and the dependency map. Give me a triage plan: top 3 hypotheses ordered by likelihood, and the single read-only command to confirm or rule out each.”

It teaches the structure of triage — hypothesize, confirm with read-only commands, work safest-first — without anyone touching production. We keep a set of triage-practice prompts aimed at exactly this kind of drill.

Sign-off should be a real decision

The transition from “onboarding” to “full rotation” should be an explicit sign-off by a mentor, not a calendar event. The question to answer: “Would I be comfortable with this person as the only responder at 3am?” If the honest answer is no, they stay in reverse-shadow longer. There’s no shame in it — there’s a lot of shame in handing someone a pager they’re not ready for and calling it their problem when it goes wrong.

Why this matters beyond kindness

A rushed on-call onboarding doesn’t just hurt the new engineer — it hurts your reliability. An unprepared responder makes slower, riskier decisions, escalates too late, and is more likely to turn a small incident into a big one. Structured onboarding is a reliability investment disguised as a people investment.

If you want a low-stakes drill partner for practice incidents and a structured triage method to teach, that’s part of what the AI Incident Response Assistant is built around.

Generated triage plans are assistive and meant for practice and review. Always verify recommendations against real systems before acting on them.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.