Building an AI-Assisted OpenStack On-Call Workflow

It’s 3am, the pager is screaming, and your phone shows forty alerts that all fired in the same ninety seconds. I’ve been that person more times than I’d like to admit, squinting at a wall of notifications, trying to figure out whether one thing broke or forty things did. The honest truth about on-call is that the hardest part isn’t fixing the problem, it’s figuring out what the problem even is before your brain is fully online. This is the part where AI has quietly become my most useful teammate, as long as I keep it on a very short leash.

The First Ninety Seconds: Don’t Read, Triage

When you’re paged into an alert storm, the instinct is to read every alert. Don’t. Most storms are one root cause echoing through dozens of dependent checks: a compute host dies and suddenly every instance on it, every volume attached to those instances, and every network port they owned all alarm at once. Your job in the first ninety seconds is to collapse forty symptoms into one or two candidate causes.

This is genuinely the best thing I’ve found AI for in on-call. I paste the sanitized alert list into a model with a tight prompt: “These alerts all fired within two minutes. Group them by likely shared root cause, rank causes by how many alerts each explains, and list the single command I’d run to confirm the top candidate.” What used to be five minutes of frantic reading becomes a ranked shortlist. I wire the same summarization into the monitoring alerts dashboard so the grouping happens before I even pick up the laptop. The broader openstack category has the deeper playbooks behind each of these.

Pro Tip: Give the AI the alert timestamps, not just the alert text. “Fired within two minutes of each other” is the single strongest signal that you’re looking at one cause, and a model will exploit it if you hand it the data.

The Commands I Run Before I’m Fully Awake

No matter what the AI says, I run the same small set of wide-net commands first, because they orient me faster than any summary. These are the muscle-memory queries:

openstack server list --all-projects --status ERROR
openstack network agent list
openstack volume list --status error
openstack compute service list

server list --all-projects --status ERROR tells me instantly whether this is a localized blip or a fleet event. network agent list shows me dead L3 or DHCP agents, the usual culprit behind “the whole tenant is down.” volume list --status error catches Cinder going sideways, and compute service list shows me which nova-compute services are down, which is often the actual root cause the alert storm was echoing.

I feed the output of these into the AI, but here’s the discipline: I run them, I read them, then I optionally hand them over for a second opinion. The model never runs them. There’s no path where the assistant has a shell on the cloud, because a fast junior engineer with admin CLI access at 3am is how you turn one outage into two.

Feeding Sanitized Logs Without Leaking the Cloud

Once I’ve narrowed to a candidate cause, I want the model’s help reading detail: a stack trace from nova-compute.log, a chunk of neutron-server output, a Cinder driver error. This is where on-call engineers get sloppy and paste raw logs into a chat box, tokens and all. I sanitize first, every time:

journalctl -u devstack@n-cpu --since "10 min ago" \
  | sed -E 's/(token|password|secret|auth)["= :]+\S+/\1=REDACTED/gi'

The model gets the shape of the error, never the credentials. It does a genuinely good job pattern-matching a traceback to a known failure class, and I’ll cross-check its read against my own. I bounce between Claude for the careful reasoning and a terminal-native assistant like Warp when I want suggestions inline, but the redaction step is the same regardless of tool. Production clouds.yaml and admin tokens never leave my machine, full stop.

Deciding What’s a Human Job and What’s an AI Job

The mental model that keeps me out of trouble is dividing every on-call task into “reversible and read-only” versus “irreversible or state-changing,” and only ever letting AI near the first bucket.

AI is great at: summarizing alert storms, grouping by root cause, reading sanitized tracebacks, recalling the right openstack flag I’ve forgotten at 3am, and drafting the timeline as I go. Those are all things where a wrong answer costs me a few seconds of verification, not an outage.

Humans own: anything that mutates state. Evacuating instances off a dead host, force-detaching a volume, disabling a compute service, restarting an agent. The AI can suggest openstack compute service set --disable <host> nova-compute, and often that suggestion is exactly right, but I’m the one who types it, because I’m the one who understands what’s running on that host and what evacuating it will do. The assistant proposes; the on-call engineer disposes. For the suggestion-drafting side, I keep proven prompts in the prompt workspace.

Where the Incident-Response Dashboard Fits

The connective tissue that makes this workflow actually flow is the incident-response dashboard. It’s where the sanitized alert summary, the candidate causes, and my running command log all land in one place, so when a second responder joins they’re caught up in seconds instead of asking me to re-explain at 3:15am. The AI drafts the running summary; I correct it as facts firm up.

Critically, that dashboard is read-and-suggest only. It surfaces information and proposes next commands, but it has no credentials to execute anything against the control plane. That boundary is the whole design. The day you let your incident tooling auto-remediate based on an LLM’s read of an alert storm is the day a hallucinated root cause evacuates a healthy host during a network partition. Keep the AI advising and the humans acting, and you get the speed without betting the cloud on it.

Post-Incident: The Writeup AI Was Born For

After the fire’s out, the writeup is the chore everyone dreads, and it’s the single most natural fit for AI in the whole workflow. I take my command log, the timeline, and the sanitized logs, and I prompt: “Draft a blameless post-incident writeup. Sections: summary, timeline, root cause, contributing factors, what went well, action items. Use only the facts I provide; mark anything you’re inferring as an assumption.”

That “mark anything you’re inferring” clause is essential, because a model will absolutely invent a plausible root cause to fill a gap, and a writeup with a confident-but-wrong root cause is worse than no writeup. So I treat the draft as a first pass that a human edits for accuracy before it’s ever published. It saves me the worst part, staring at a blank page at 4am, while keeping the facts honestly mine. The action items, especially, get human review, because those become tomorrow’s work.

Conclusion

A good AI-assisted on-call workflow isn’t about handing the cloud to a robot, it’s about putting a fast, tireless junior engineer next to you who can collapse an alert storm into a shortlist, read a traceback while you rub the sleep out of your eyes, and draft the writeup so you don’t have to. The guardrails are simple and absolute: sanitize before you share, keep production credentials off every model, let AI read and suggest but never execute against the control plane, and verify every inference. Hold that line and the next 3am page is a little less brutal. The prompts library and ready-made prompt packs are where I keep the templates that make it repeatable.