Skip to content
CloudOps
Newsletter Sign up
All guides
AI for Incident Response By James Joyner IV · · 9 min read

Designing a Healthy On-Call Rotation That Doesn't Burn People Out

On-call burnout is a design problem, not a willpower problem. A veteran SRE's guide to rotation structure, fair load, health metrics, and using AI to reduce noise.

  • #incident-response
  • #on-call
  • #sre
  • #burnout
  • #rotation
  • #alerting

I’ve been on-call for systems that paged me twice a year and systems that paged me twice a night. The difference between the two wasn’t the technology — it was whether anyone treated the rotation as something to be designed. On-call burnout is almost never a willpower problem. It’s a design problem, and design problems have fixes.

Here’s how I structure rotations so people can sustain them for years instead of quitting after one bad quarter.

Start by measuring the load you actually have

You can’t design a humane rotation if you don’t know how heavy it is. Track, per rotation:

  • Pages per shift, split by day and night.
  • Off-hours pages specifically — these cost the most.
  • Actionable rate — what fraction of pages required a human to actually do something. If half your pages are noise, you have an alerting problem masquerading as a staffing problem.
  • Time to acknowledge and time to resolve.

If you don’t have these numbers, get them before you change anything. Most “we need more people” problems are actually “we need fewer false pages” problems.

Rotation structure basics

Length. Weekly rotations are the common default and a good one — long enough to amortize handoff overhead, short enough that a bad week ends. Avoid month-long primary shifts; sustained vigilance is corrosive.

Primary plus secondary. A secondary who gets paged only when the primary doesn’t ack within a few minutes. This is your safety net and your “I’m in a tunnel / asleep / lost signal” backstop.

Follow-the-sun, if you can. With teams in multiple time zones, route night-time pages to people who are awake. Nothing else you do will help sleep as much. Most teams can’t do this, which makes everything below more important.

Minimum bench size. Fewer than four or five people in a rotation and everyone is on-call too often to recover. If you’re below that, the real fix is hiring or merging rotations — not pushing harder.

Make it fair, and make the load visible

Resentment kills rotations faster than pages do. Two things keep it fair:

Compensate or comp the time. Whether it’s pay, time off, or reduced project load during the on-call week, on-call is work and should be acknowledged as work. Pretending it’s free is how you lose your best people.

Publish the load. When everyone can see that the burden is shared evenly — and that a particularly brutal week earns recovery time — people tolerate the hard weeks. Hidden, uneven load breeds quiet resentment.

Health metrics to watch on the rotation itself

Treat the rotation like a system with its own SLOs:

  • Nights of interrupted sleep per shift. Your most important humane metric. More than one or two and the rotation is unsustainable.
  • Page acknowledgment time trending up. A sign of fatigue or alert blindness.
  • Handoff quality. Are ongoing issues actually passed on, or does the new primary walk into a surprise?
  • Voluntary turnover on the team. The slow, expensive signal you never want to be the one tracking.

Review these every month. A rotation that’s quietly degrading shows up in the numbers long before someone hands in their notice.

The biggest lever: cut the noise

The fastest way to a healthier rotation is fewer pages, and most pages don’t deserve to be pages. Audit your alerts ruthlessly:

  • Every alert must be actionable — if there’s nothing a human can do, it shouldn’t page; make it a ticket or a dashboard.
  • Every alert should be tied to customer impact or imminent impact. “Disk 80% full” with days of runway is not a 3 AM page.
  • Kill the chronically-firing alerts. An alert that fires every night and gets ignored has trained your team to ignore alerts.

This is unglamorous work, and it’s the single highest-return thing you can do for on-call health.

Where AI helps

AI won’t fix your staffing, but it directly attacks the two things that make shifts brutal: noise and 3 AM cognitive load.

Triage assistance means the responder spends less time decoding the firehose. Paste the alerts and logs and get a structured summary and a ranked set of read-only diagnostics, so a 3 AM page is a ten-minute event instead of an hour.

Alert-quality review. Feed a month of page data to a model and ask it to flag alerts with low actionable rates, frequent flapping, or clustering that suggests one root alert spawning ten. It’s a fast way to find the noise worth killing.

“Here is one month of paging data: alert name, time, whether it was actionable. Identify the top sources of non-actionable pages and the alerts most likely firing at night without customer impact.”

We keep incident-response prompts for fast triage that shortens every page, and the Incident Response tool turns symptoms into a safest-first plan so the on-call brain has less to carry.

The standard to hold

A healthy rotation is one a person can run for years without dreading their week. If your team flinches when the on-call calendar comes around, the design is wrong — not the people. Measure the load, cut the noise, share it fairly, and protect sleep above almost everything else.

AI triage assistance reduces on-call load but does not replace human judgment. Always verify recommendations before acting in production.

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,300+ DevOps AI prompts
  • One practical workflow email per week