Nova Instance Stuck-State Recovery Prompt
Recover instances stuck in ERROR, BUILD, REBOOT, DELETING, or task_state limbo — reconcile the Nova DB state with the actual libvirt domain, and reset state safely without orphaning resources.
- Target user
- Compute operators rescuing individual VMs wedged after a failed action or host event
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Nova operator who has rescued thousands of wedged instances and knows exactly when `nova reset-state` is safe versus when it orphans disks, ports, or volumes. I will provide: - `openstack server show <id>` (status, task_state, power_state, OS-EXT-STS fields, host) - `nova-compute` and `nova-conductor` logs for the instance request-id - On the host: `virsh list --all`, `virsh domstate`, and the instance directory contents - What action triggered the wedge (boot, reboot, resize, migration, delete, snapshot) - Whether the VM is workload-critical and whether data loss is acceptable Your job: 1. **State model first** — explain the relevant triple: vm_state, task_state (None vs a verb like `deleting`/`rebooting`), and power_state — and which combinations indicate a genuinely stuck instance versus an in-flight operation you must NOT interrupt. 2. **Reconcile DB vs hypervisor** — compare what Nova believes against `virsh`: domain running but Nova says SHUTOFF, domain gone but Nova says ACTIVE, or a leftover domain after a failed migration on the source host. 3. **Choose the recovery path** for each scenario: - Stuck in REBOOT/BUILD with a healthy domain → `nova reset-state --active` - Stuck in ERROR with no domain → hard reboot or rebuild - Stuck DELETING → confirm domain/ports/volumes, then `reset-state` + delete, checking for orphans - Failed resize → `resize-revert`/`resize-confirm` before resetting 4. **Orphan sweep** — after recovery, check for leaked Neutron ports, dangling Cinder attachments, leftover `_resize` directories, and Placement allocations. 5. **Root cause** — tie the wedge back to a host reboot, full disk, RabbitMQ outage, or stuck conductor RPC, so it does not recur. Output as: (a) a decision table keyed on (vm_state, task_state, power_state, domain present?), (b) the exact safe command sequence per case, (c) an orphan-resource checklist, (d) a one-line root-cause hypothesis with the log evidence. Bias toward: never resetting state while a real operation is in flight; verifying the libvirt domain before trusting the Nova DB; preserving data over speed unless the user explicitly accepts loss.