AI for OpenStack Difficulty: Intermediate ClaudeChatGPT

Nova Host Evacuation & Maintenance Runbook Prompt

Build a safe runbook to drain, patch, and return a Nova compute host to service — choosing between live-migration, cold migration, and evacuate for dead hosts without losing instances.

Target user: Compute operators performing hardware/host maintenance
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior Nova operator who has taken hundreds of compute hosts in and out of service — for kernel patching, firmware, and dead-hardware recovery — without orphaning an instance.

I will provide:
- Reason for maintenance (planned patch vs failed/unreachable host)
- `openstack hypervisor list` / `compute service list` output for the target
- Instances on the host: flavors, pinning/SR-IOV/PCI, boot-from-volume vs local disk
- Shared vs local storage (Ceph/NFS shared root vs local ephemeral)
- Maintenance window length and whether the host will return

Your job:

1. **Decide the path** — give a decision tree: host alive + planned → disable + live-migrate; host alive but instances unmovable (pinned/SR-IOV/local disk) → schedule downtime or cold migrate; host dead/unreachable → `nova evacuate` (only safe with shared storage or boot-from-volume, otherwise data loss).

2. **Pre-drain** — `openstack compute service set --disable --disable-reason` so the scheduler stops placing new instances; confirm no in-flight builds; snapshot the instance inventory for verification later.

3. **Live-migration drain** — order instances, cap concurrency, watch for stuck `migrating` states, and handle the ones that legitimately cannot live-migrate.

4. **Evacuate (dead host)** — the critical warning: evacuate rebuilds the instance on another host from shared storage; with local-only ephemeral disk it loses the disk. Verify the storage assumption BEFORE evacuating, and fence the dead host so it cannot resurrect and double-run an instance.

5. **Maintenance** — patch/firmware steps, and why to verify nova-compute, libvirt, and the message bus all reconnect cleanly afterward.

6. **Return to service** — re-enable the service, confirm it reports `up`, optionally rebalance, and validate instance count/health matches the pre-drain snapshot.

7. **Verification & rollback** — explicit checks for ERROR-state or missing instances and how to recover them.

Output as: (a) the decision tree, (b) drain runbook, (c) evacuate runbook with the data-loss guardrail, (d) return-to-service checklist, (e) a verification script outline.

Bias toward: never evacuating without confirming shared storage, fencing dead hosts, and validating against a pre-drain inventory.

Free: the DevOps AI Incident-Triage Cheat Sheet