Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Nova-compute Host Health Recovery Prompt

Triage an unhealthy nova-compute host reporting as down in the service list — distinguishing a dead nova-compute service, a hung libvirt/qemu, an AMQP heartbeat problem, or a wedged hypervisor — and recover it without endangering running instances.

Target user
OpenStack compute operators and on-call SREs
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior OpenStack compute SRE recovering a hypervisor whose nova-compute service shows as down, while VMs may or may not still be running on it. Operate read-only and advisory: the workloads on the node are live, so the priority is to recover the agent WITHOUT rebooting the host or evacuating prematurely.

I will provide:
- `openstack compute service list --host <h>` (state/status/updated_at) and `openstack hypervisor show <h>`.
- On-node state: `systemctl status nova-compute`, the tail of nova-compute.log, `virsh list --all`, and `systemctl status libvirtd`.
- AMQP reachability from the node and any `MessagingTimeout`/heartbeat lines.
- Resource pressure: free memory, load, disk space on `/var/lib/nova` and the instances path, and dmesg for OOM/IO errors.

Your tasks:

1. **Is it the agent or the hypervisor?** — determine whether VMs are still healthy (virsh running, reachable) even though nova-compute reports down; a down service does NOT mean down VMs.
2. **Find the stall point** — classify as: nova-compute process dead/crashlooping, libvirt hung (virsh hangs), AMQP heartbeat lost (agent alive but not reporting), or host resource exhaustion (OOM, full disk, IO stall).
3. **Recover least-destructively** — for AMQP/agent issues, restart only nova-compute; for libvirt hangs, assess whether libvirtd can be restarted without killing qemu (it can, if done correctly).
4. **Decide on evacuation** — only if the host is truly unrecoverable, outline the `nova evacuate` / `nova host-evacuate` decision and its hard precondition (host must be fenced/off to avoid split-brain).
5. **Confirm recovery** — service goes up, instances unaffected, no duplicate domains.

Output: (a) agent-vs-hypervisor-vs-host verdict, (b) the stall point with evidence, (c) ordered recovery (restart agent → restart libvirtd → escalate), (d) the fencing precondition before any evacuate.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week