Writing OpenStack Diagnostic Runbooks with AI Prompt

I have a folder of OpenStack runbooks that took the better part of a decade to write, and most of them were born the same painful way: something broke at a bad hour, I stumbled through the diagnosis, and I swore I’d write it down so the next person wouldn’t suffer. The problem is that “I’ll write it down later” almost never happened. What changed for me wasn’t discipline, it was learning to prompt an LLM to draft the first version while the incident was still fresh. The AI is a fast junior engineer who’ll happily produce a tidy document; my job is to make sure that document is correct, safe, and not full of plausible nonsense.

Why Runbooks Are the Right Job for AI

A diagnostic runbook is a structured, repeatable artifact, and structure is exactly what LLMs are good at producing. You give it a failure mode and a target shape, it fills in the scaffolding. Where it struggles is judgment: knowing that a particular command is destructive, that a log line means something subtle, or that your cloud has a quirk no public documentation describes. So I split the labor. The model drafts structure and the obvious check sequence; I supply the judgment and the verification. If you want to see how this fits a broader workflow, the openstack category collects the operational pieces.

The runbook structure I insist on is always the same four-part spine:

Symptom — the observable signal that triggers this runbook.
Checks — an ordered, non-destructive sequence to localize the fault.
Remediation — the smallest safe action, with rollback noted.
Escalation — when to stop and page a human, and which one.

When I prompt, I hand the model that spine explicitly. A vague “write me a runbook for network issues” gets you generic mush. A precise prompt gets you something usable.

Prompting for a Single Failure Mode

Take “instance stuck in BUILD.” Here’s the kind of prompt that works:

“You are drafting an OpenStack triage runbook for an SRE team. Failure mode: an instance has been in BUILD state for over ten minutes. Use this structure: Symptom, Checks (ordered, non-destructive only), Remediation, Escalation. For Checks, produce the exact openstack CLI commands a responder runs, in order, with one line explaining what each output tells us. Flag any command that mutates state and exclude it from Checks.”

That last sentence is load-bearing. Without it, models cheerfully slip openstack server delete into a diagnostic section. With it, the Checks section stays read-only, which is the whole point of triage. The draft it produces for BUILD typically lands on a sequence like:

openstack server show <uuid> -c status -c fault -c "OS-EXT-STS:task_state"
openstack server event list <uuid>
openstack compute service list --service nova-compute
openstack network agent list

The fault field and task_state are where the real story is; a model that knows OpenStack will surface them, and one that’s bluffing will give you only openstack server list. That contrast is itself a quick way to gauge whether the AI actually understands the platform. I draft these in the prompt workspace so I can iterate on the wording without losing good versions.

Covering the Classics: Network Down and Volume Detach Hangs

I keep a runbook per recurring failure, and the prompt pattern repeats with the symptom swapped. For “tenant network unreachable,” the AI-drafted check sequence should walk from agents to ports to flows:

openstack network agent list --agent-type dhcp
openstack port list --server <uuid>
openstack router show <router> -c status -c external_gateway_info

For a volume stuck detaching, the honest check sequence reaches into both Cinder and Nova because the hang almost always lives in the gap between them:

openstack volume show <uuid> -c status -c attachments
openstack volume list --status detaching
openstack server show <uuid> -c "os-extended-volumes:volumes_attached"

Here’s where I lean on the model hardest and trust it least: it will often suggest openstack volume set --state available as remediation. That command force-resets Cinder’s idea of state without touching the actual hypervisor attachment, and used carelessly it leads to data corruption when the volume is genuinely still attached. So in my runbook, that line goes into Remediation with a giant “destructive, verify the libvirt attachment is gone first” warning that I write, not the AI.

Pro Tip: Prompt the model to mark every state-changing command with a DESTRUCTIVE: prefix, then grep for that prefix in review. It turns “did I miss a dangerous step?” from a careful read into a one-line search.

The fastest way to leak a credential is to paste a raw OpenStack log into a chat window. Tokens, project IDs, sometimes whole clouds.yaml fragments end up in debug output. Before any log goes to an LLM, it gets sanitized:

openstack server show <uuid> -f json \
  | sed -E 's/(token|password|secret)["= :]+[^",}]+/\1=REDACTED/gi'

I treat that as a hard gate, not a nicety. The model does not need your real tenant IDs to help you structure a runbook, and it absolutely must never see admin tokens or production clouds.yaml. The rule in my team is simple: the AI gets the shape of the problem, never the keys to the cloud. Whichever assistant you use, ChatGPT, Claude, or a self-hosted model, the redaction step is identical and non-negotiable.

Version Control and Code Review for AI Output

A runbook that lives in someone’s notes is worse than no runbook, because people trust it. So every runbook goes into git, in the same repo as the rest of our ops tooling, and AI-drafted runbooks go through pull-request review exactly like code. The reviewer’s job is specifically to catch the things AI gets wrong: a command that doesn’t exist on our OpenStack release, a check that assumes a service we don’t run, a remediation that’s destructive without saying so.

I route these through the same review discipline we use for code, the code review dashboard flags the destructive-command patterns automatically. Treating an AI-drafted runbook as untrusted input until a human signs off is the single habit that’s saved me the most grief. The model’s draft is a starting point, and the git history makes it honest about who actually approved it. For reusable runbook prompt templates, I keep a set in the prompt packs.

Turning Runbooks Into Automation, Carefully

The natural next step is to wire a reviewed runbook into automation: a script that runs the read-only Checks and posts a summary when an alert fires. This is great, and it is also exactly where teams hurt themselves by going one step too far and letting the automation execute the Remediation steps unattended.

My rule: automate the Checks, never the Remediation, until a human has watched the automation make the right call dozens of times. A script that gathers openstack server show, openstack server event list, and an agent listing and pastes them into the incident channel is a force multiplier. A script that auto-runs openstack volume set --state available because an LLM suggested it is a future outage with a timer on it. The AI can propose the automation; a human decides what it’s actually allowed to touch.

Conclusion

Prompt engineering didn’t replace my decade of runbooks, it just removed the excuse for not writing the next one. Give the model the four-part spine, force it to mark destructive steps, redact before you share, and put every draft through git and human review. Do that and the AI behaves like the fast, eager junior it is: tremendously useful for getting a first draft on the page, never trusted with the production keys or the destructive button. Browse the prompts library for templates to start from, and you’ll have a runbook written before the incident’s even closed.

Writing OpenStack Diagnostic Runbooks with AI Prompt Engineering

Why Runbooks Are the Right Job for AI

Prompting for a Single Failure Mode

Covering the Classics: Network Down and Volume Detach Hangs

Version Control and Code Review for AI Output

Turning Runbooks Into Automation, Carefully

Conclusion

Download the Free 500-Prompt DevOps AI Toolkit

Why Runbooks Are the Right Job for AI

Prompting for a Single Failure Mode

Covering the Classics: Network Down and Volume Detach Hangs

Redact Before You Share a Single Log

Version Control and Code Review for AI Output

Turning Runbooks Into Automation, Carefully

Conclusion

Download the Free 500-Prompt DevOps AI Toolkit