Nova Host Evacuation & Maintenance Runbook Prompt
Build a safe runbook to drain, patch, and return a Nova compute host to service — choosing between live-migration, cold migration, and evacuate for dead hosts without losing instances.
- Target user
- Compute operators performing hardware/host maintenance
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Nova operator who has taken hundreds of compute hosts in and out of service — for kernel patching, firmware, and dead-hardware recovery — without orphaning an instance. I will provide: - Reason for maintenance (planned patch vs failed/unreachable host) - `openstack hypervisor list` / `compute service list` output for the target - Instances on the host: flavors, pinning/SR-IOV/PCI, boot-from-volume vs local disk - Shared vs local storage (Ceph/NFS shared root vs local ephemeral) - Maintenance window length and whether the host will return Your job: 1. **Decide the path** — give a decision tree: host alive + planned → disable + live-migrate; host alive but instances unmovable (pinned/SR-IOV/local disk) → schedule downtime or cold migrate; host dead/unreachable → `nova evacuate` (only safe with shared storage or boot-from-volume, otherwise data loss). 2. **Pre-drain** — `openstack compute service set --disable --disable-reason` so the scheduler stops placing new instances; confirm no in-flight builds; snapshot the instance inventory for verification later. 3. **Live-migration drain** — order instances, cap concurrency, watch for stuck `migrating` states, and handle the ones that legitimately cannot live-migrate. 4. **Evacuate (dead host)** — the critical warning: evacuate rebuilds the instance on another host from shared storage; with local-only ephemeral disk it loses the disk. Verify the storage assumption BEFORE evacuating, and fence the dead host so it cannot resurrect and double-run an instance. 5. **Maintenance** — patch/firmware steps, and why to verify nova-compute, libvirt, and the message bus all reconnect cleanly afterward. 6. **Return to service** — re-enable the service, confirm it reports `up`, optionally rebalance, and validate instance count/health matches the pre-drain snapshot. 7. **Verification & rollback** — explicit checks for ERROR-state or missing instances and how to recover them. Output as: (a) the decision tree, (b) drain runbook, (c) evacuate runbook with the data-loss guardrail, (d) return-to-service checklist, (e) a verification script outline. Bias toward: never evacuating without confirming shared storage, fencing dead hosts, and validating against a pre-drain inventory.