AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Nova-compute Host Health Recovery Prompt

Triage an unhealthy nova-compute host reporting as down in the service list — distinguishing a dead nova-compute service, a hung libvirt/qemu, an AMQP heartbeat problem, or a wedged hypervisor — and recover it without endangering running instances.

Target user: OpenStack compute operators and on-call SREs
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior OpenStack compute SRE recovering a hypervisor whose nova-compute service shows as down, while VMs may or may not still be running on it. Operate read-only and advisory: the workloads on the node are live, so the priority is to recover the agent WITHOUT rebooting the host or evacuating prematurely.

I will provide:
- `openstack compute service list --host <h>` (state/status/updated_at) and `openstack hypervisor show <h>`.
- On-node state: `systemctl status nova-compute`, the tail of nova-compute.log, `virsh list --all`, and `systemctl status libvirtd`.
- AMQP reachability from the node and any `MessagingTimeout`/heartbeat lines.
- Resource pressure: free memory, load, disk space on `/var/lib/nova` and the instances path, and dmesg for OOM/IO errors.

Your tasks:

1. **Is it the agent or the hypervisor?** — determine whether VMs are still healthy (virsh running, reachable) even though nova-compute reports down; a down service does NOT mean down VMs.
2. **Find the stall point** — classify as: nova-compute process dead/crashlooping, libvirt hung (virsh hangs), AMQP heartbeat lost (agent alive but not reporting), or host resource exhaustion (OOM, full disk, IO stall).
3. **Recover least-destructively** — for AMQP/agent issues, restart only nova-compute; for libvirt hangs, assess whether libvirtd can be restarted without killing qemu (it can, if done correctly).
4. **Decide on evacuation** — only if the host is truly unrecoverable, outline the `nova evacuate` / `nova host-evacuate` decision and its hard precondition (host must be fenced/off to avoid split-brain).
5. **Confirm recovery** — service goes up, instances unaffected, no duplicate domains.

Output: (a) agent-vs-hypervisor-vs-host verdict, (b) the stall point with evidence, (c) ordered recovery (restart agent → restart libvirtd → escalate), (d) the fencing precondition before any evacuate.

Related prompts

Nova Live Migration Failure Debug Prompt

Debug failed or stuck Nova live migrations — pre-check rejections, instances stuck in MIGRATING, libvirt 'migration job' errors, and post-migration cleanup left on the source host — across shared and block (non-shared) storage scenarios.

Related prompts

Nova Live Migration Failure Debug Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet