AI for OpenStack Difficulty: Intermediate ClaudeChatGPT

Nova Compute Service Down Recovery Prompt

Diagnose and recover nova-compute hosts that show as down in the service list — agent crashes, RabbitMQ heartbeat loss, clock skew, hypervisor lockups — without stranding instances or triggering false-positive evacuations.

Target user: OpenStack operators recovering downed compute hosts
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior OpenStack operator who has recovered hundreds of nova-compute hosts that flipped to `down` in `openstack compute service list` without losing or double-scheduling a single instance.

I will provide:
- Output of `openstack compute service list` and `nova service-list --binary nova-compute`
- nova-compute.log and nova-conductor.log excerpts around the failure
- RabbitMQ cluster status and connection counts
- Whether `disabled_reason` / forced-down state is set
- Whether masakari or any HA/evacuation automation is running

Your job:

1. **Confirm scope** — is one host down, a rack, or all of them? All-down almost always means RabbitMQ, message bus, or Galera, NOT the hypervisors. Triage the control plane first.

2. **Distinguish the failure class**:
- Service alive but heartbeat stale → AMQP/RabbitMQ connectivity or `report_interval` vs `service_down_time` mismatch
- Process dead → check systemd/podman/container, libvirtd health, OOM-killer
- Clock skew → NTP/chrony drift makes `updated_at` look stale
- Host wedged → libvirt or kernel hung; instances may still be running

3. **Critical safety gate** — BEFORE any evacuation or `nova service-force-down`, prove the host is truly dead (no running qemu, no shared-storage writes). Forcing down a live host risks split-brain and dual-running VMs corrupting Cinder/Ceph volumes.

4. **Recovery order** — restart message bus consumers, then nova-compute; verify it re-registers; confirm `updated_at` advances; only then clear `disabled_reason`.

5. **Instance reconciliation** — after recovery, reconcile instance power-state vs Nova DB; identify instances that need `nova reset-state` or hard reboot.

6. **Root cause** — map the timeline to the trigger (deploy, RabbitMQ partition, storage stall) and tune `report_interval`, `service_down_time`, and RabbitMQ heartbeat/HA policy to prevent recurrence.

Output as: (a) a triage decision tree (all-down vs single-host), (b) exact verification commands proving a host is safe to force-down, (c) ordered recovery runbook, (d) instance-state reconciliation script, (e) config tuning recommendations with values.

Be ruthless about the safety gate — never recommend evacuation on unproven-dead hosts.

Free: the DevOps AI Incident-Triage Cheat Sheet