Identifying and Eliminating Toil with AI: An SRE Playbook

Toil is the work that keeps a system running but produces no lasting value — manual, repetitive, automatable, and scaling linearly with the service. The SRE book named it; every ops team drowns in it. The hard part isn’t believing toil is bad. It’s seeing it clearly enough to prioritize, and then actually building the automation. AI helps with both, more than you’d expect. Here’s the playbook I run.

First, define toil so you can count it

If you can’t measure it, you’ll automate the loud annoyance instead of the expensive one. Google’s definition gives you a checklist. Work is toil if it’s:

Manual — a human runs it by hand.
Repetitive — done over and over, the same way.
Automatable — a machine could do it.
Tactical, not strategic — reactive interrupt work, not design.
Without enduring value — the service is the same afterward.
Scales with the service — more traffic/hosts means more of it.

Score tasks against that list. A task hitting five of six is prime automation material. A task hitting two is probably just engineering work that feels annoying.

Find the toil hiding in plain sight

Toil is invisible precisely because it’s habitual — nobody logs “spent 20 minutes manually restarting the stuck worker again.” So you have to go looking. Three sources reveal it:

Shell history and runbooks. The commands you run by hand, repeatedly, are toil with a neon sign. AI is genuinely useful for mining this. Feed it (scrubbed) shell history or your ticket queue and ask it to cluster:

“Here is a week of (anonymized) on-call ticket titles and resolution notes. Cluster them into recurring task types. For each cluster, estimate frequency, whether resolution looks identical across tickets, and how automatable it appears. Flag the top 5 candidates for automation.”

The model is good at spotting the pattern you’ve gone blind to — the same three-step fix applied 14 times under 14 different ticket titles.

A toil log. For two weeks, have the team tag interrupt work with a one-word category. Crude, but it surfaces the distribution. You’ll usually find a long tail and two or three fat clusters that eat most of the hours.

Alert noise. Alerts that fire and get the same human response every time are toil pretending to be incidents. Those are your fastest wins — see event-driven automation for how to act on them.

Prioritize by ROI, not by annoyance

Not all toil is worth automating. The test is honest math:

automation_value = (time_per_run * runs_per_period * people_affected)
                   - (build_cost + maintenance_cost_per_period)

The classic trap is the XKCD one: spending three weeks automating a task that takes four minutes a month. Rank candidates by hours saved per quarter against build-plus-maintenance cost. Automate the boring frequent thing, not the interesting rare thing.

A quick prioritization grid:

Frequency	Effort to automate	Verdict
High	Low	Automate now
High	High	Automate, schedule it
Low	Low	Automate if quick
Low	High	Leave it; document it

Use AI to draft the automation — then review hard

Once you’ve picked a target, AI compresses the build. It’s good at turning a described manual procedure into a first-draft script:

“I currently fix this by hand: SSH to the worker host, check systemctl status batch-worker, if it’s failed I clear /var/run/batch.lock and restart it, then tail the log for 30 seconds to confirm. Write an idempotent bash script that does this safely, with set -euo pipefail, a dry-run flag, input validation, and clear logging. Do not include any step I didn’t describe.”

That last sentence matters — left open, the model will add a “helpful” rm -rf cleanup you never asked for. Treat the output as a draft from a fast junior engineer: read every line, add the dry-run, make it idempotent, and test it in non-prod before it touches anything real. AI writing the automation does not remove the review; it relocates the work from typing to reviewing, which is a good trade.

Make the automation safe and durable

Toil-killing scripts have a way of becoming load-bearing infrastructure nobody owns. Avoid that:

Idempotent and dry-run-capable. Safe to run twice; able to show what it would do first.
In version control, with review. A toil script edited live on a box is new toil.
Owned and monitored. If it fails silently, the toil comes back plus a debugging session.
Documented in the runbook it replaces, so the next person knows the automation exists.

Close the loop: measure the toil you removed

The whole point is to drive toil down, so track it. Re-run your toil log a month after automating a cluster and confirm those hours actually disappeared rather than morphing into “maintaining the automation.” If maintenance cost exceeds the toil you removed, you automated the wrong thing — retire it and move on. Honest measurement is what keeps automation from becoming its own toil.

Where to start tomorrow

Pull last week’s ticket titles, have AI cluster them, pick the highest-frequency low-effort cluster, and automate exactly that — idempotent, dry-run, in git, owned. One cluster a sprint compounds fast; six months in, the team’s interrupt load is visibly lighter and on-call is doing engineering instead of button-pushing.

For the interrupt work that isn’t automatable yet — the genuine incidents — give on-call a fast path with our AI Incident Response Assistant, and find more elimination patterns under AI for Automation.

AI-drafted automation is a starting point, not a finished script. Add dry-run, make it idempotent, review every line, and test against your own systems first.