OpenStack Error Guide: nova-compute Service State 'down' / Hypervisor Down
Fix the nova-compute 'down' hypervisor state in OpenStack: diagnose dead services, RabbitMQ drops, clock skew, stale placement, and force-down for evacuation.
- #openstack
- #troubleshooting
- #errors
- #nova
Overview
A hypervisor shows down when nova-conductor has not received a heartbeat from that host’s nova-compute service within service_down_time seconds (default 60). The compute node may still be running instances perfectly, but the control plane believes it is gone: the scheduler stops placing new instances there, and operators may try to evacuate.
You will see this in openstack hypervisor list or openstack compute service list:
+----+------------------+------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+------------------+------------+----------+---------+-------+----------------------------+
| 7 | nova-compute | compute-04 | nova | enabled | down | 2026-06-24T13:41:02.000000 |
+----+------------------+------------+----------+---------+-------+----------------------------+
And when scheduling lands on it, or an API call references it:
Compute service of compute-04 is down. (HTTP 400)
The state is purely a function of the last report time. nova-compute writes its “I’m alive” timestamp into the services table (via conductor RPC) on every periodic loop. If that loop stops, or the message never lands, the host flips to down even if libvirtd keeps the VMs up.
Symptoms
openstack hypervisor listorcompute service listshowsState = down.- New instances skip the host;
No valid host was foundif it was the only candidate. - API errors like
Compute service of <host> is down. Updated Atfor the service is stale (older thanservice_down_time).
openstack compute service list --service nova-compute -c Host -c State -c "Updated At"
+------------+-------+----------------------------+
| Host | State | Updated At |
+------------+-------+----------------------------+
| compute-03 | up | 2026-06-24T14:02:55.000000 |
| compute-04 | down | 2026-06-24T13:41:02.000000 |
+------------+-------+----------------------------+
The 20-minute-stale Updated At on compute-04 is the tell: it stopped reporting at 13:41.
Common Root Causes
1. The nova-compute process is actually dead
The simplest cause: the service crashed, OOMed, or was stopped and never restarted. No process means no heartbeat.
# Kolla-Ansible
docker ps --filter name=nova_compute --format '{{.Names}}\t{{.Status}}'
# Traditional packages
sudo systemctl status nova-compute --no-pager
nova_compute Exited (1) 22 minutes ago
An Exited container or inactive (dead) unit that stopped right at the stale Updated At time confirms it.
2. RabbitMQ connectivity is broken
nova-compute reports up by RPC over the message bus. If the host cannot reach RabbitMQ — network partition, dead rabbit node, expired credentials, AMQP heartbeat timeout — the service runs but never reaches conductor, so it goes down.
# Kolla-Ansible
docker logs nova_compute 2>&1 | grep -iE 'AMQP|rabbit|reconnect' | tail -10
# Traditional
sudo journalctl -u nova-compute | grep -iE 'AMQP|rabbit|reconnect' | tail -10
ERROR oslo.messaging._drivers.impl_rabbit [-] [...] AMQP server on 10.0.0.11:5672 is unreachable: timed out. Trying again in 32 seconds.
The process is alive but isolated. Verify the port from the compute host:
nc -zv 10.0.0.11 5672
3. Clock skew versus service_down_time
The down calculation compares the report timestamp to “now” on the controller. If the compute node’s clock drifts ahead, its timestamps look future-dated and conductor may treat fresh reports as stale or vice versa once corrected.
# On controller and compute
date -u
timedatectl status | grep -E 'synchronized|NTP'
System clock synchronized: no
NTP service: inactive
A skew larger than service_down_time (60s default) reliably produces phantom down states.
4. Placement resource provider gone stale
When a host has been down, nova-compute’s update of its placement resource provider stops. The scheduler then sees stale or zero inventory, compounding No valid host was found even after the service recovers until the next update_available_resource loop runs.
openstack resource provider list --name compute-04
openstack resource provider inventory list <RP_UUID>
+--------------+--------+------------------+----------+----------+
| resource_class | total | reserved | min_unit | max_unit |
+--------------+--------+------------------+----------+----------+
| VCPU | 0 | 0 | 1 | 0 |
Zeroed inventory means placement still has the provider but no usable capacity reported.
5. libvirtd is down (compute can’t update its resources)
nova-compute depends on libvirtd for the resource-tracker update. If libvirt is dead or hung, the periodic update_available_resource task can error out repeatedly and the service heartbeat may stall behind it.
# Kolla-Ansible
docker ps --filter name=nova_libvirt --format '{{.Names}}\t{{.Status}}'
# Traditional
sudo systemctl status libvirtd --no-pager
docker logs nova_compute 2>&1 | grep -i libvirt | tail -10
ERROR nova.compute.manager [-] Error updating resources for node compute-04: libvirt.libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory
6. Agent-to-conductor RPC / cell database problems
nova-compute talks to conductor, which writes the service record to the cell database. A broken cell mapping, an unreachable cell DB, or a conductor outage means the heartbeat never persists.
# Kolla-Ansible (on controller)
docker exec nova_api nova-manage cell_v2 list_cells
docker logs nova_conductor 2>&1 | grep -iE 'error|cell|DBConnectionError' | tail -10
ERROR oslo_db.sqlalchemy.engine.Connection ... DBConnectionError: (pymysql) Can't connect to MySQL server on 'cell1-db'
Diagnostic Workflow
Step 1: Confirm the state and the last report time
openstack compute service list --service nova-compute -c Host -c State -c "Updated At" -c Status
openstack hypervisor show compute-04 -c state -c status -f value 2>/dev/null
Note the exact Updated At — that timestamp is roughly when the host stopped reporting and anchors every other check.
Step 2: Is the process even running?
# Kolla-Ansible
docker ps -a --filter name=nova_compute
# Traditional
sudo systemctl status nova-compute --no-pager
If dead, restart it and watch for clean startup:
docker restart nova_compute # Kolla-Ansible
sudo systemctl restart nova-compute # Traditional
Step 3: Check the message bus and clock from the compute host
nc -zv <RABBIT_VIP> 5672
sudo journalctl -u nova-compute | grep -iE 'AMQP|rabbit' | tail -10
timedatectl status | grep -E 'synchronized|NTP'
A reachable RabbitMQ plus a synchronized clock rules out the two most common silent causes.
Step 4: Verify libvirt and the resource-tracker
sudo systemctl status libvirtd --no-pager # or: docker ps --filter name=nova_libvirt
docker logs nova_compute 2>&1 | grep -iE 'update_available_resource|libvirt' | tail -20
Repeated resource-update errors mean fix libvirt first; the heartbeat recovers once the periodic task completes.
Step 5: Recover, and force-down if you must evacuate
Once the underlying fault is fixed, the host flips back to up on its next report. If the host is genuinely lost and you need to evacuate instances safely, mark it down so Nova permits evacuation:
# Stop the scheduler from using it
openstack compute service set --disable --disable-reason "hw failure" compute-04 nova-compute
# Tell Nova the host is really down (enables evacuate)
nova service-force-down compute-04 nova-compute # or: openstack compute service set --down compute-04 nova-compute
openstack server evacuate <SERVER>
# After repair, clear it
openstack compute service set --up --enable compute-04 nova-compute
Example Root Cause Analysis
compute-04 shows down with Updated At frozen at 13:41, yet its instances respond to ping. Because the VMs are alive, the process or its uplink — not the hardware — is suspect.
The container is up, so it is not a dead process:
nova_compute Up 4 hours
The nova-compute log explains the silence:
ERROR oslo.messaging._drivers.impl_rabbit [-] [c2f1...] AMQP server on 10.0.0.11:5672 is unreachable: [Errno 113] No route to host. Trying again in 32 seconds.
A nc -zv 10.0.0.11 5672 from compute-04 times out, while compute-03 connects fine. A switch port flap had dropped compute-04 off the management VLAN at 13:41 — exactly the stale timestamp. The service was healthy but could not deliver its heartbeat.
Fix: restore the management link, then confirm reconnection:
docker logs nova_compute 2>&1 | grep -i 'Connected to AMQP' | tail -1
openstack compute service list --service nova-compute --host compute-04 -c State
INFO oslo.messaging._drivers.impl_rabbit [-] [c2f1...] Reconnected to AMQP server on 10.0.0.11:5672
Within one report interval the host returns to up and the scheduler resumes placing on it.
Prevention Best Practices
- Alert on
State = downand on staleUpdated At: pollopenstack compute service listand page the moment any host exceedsservice_down_time. A down hypervisor silently shrinks scheduling capacity. - Run NTP/chrony on every node and alert on
synchronized: no. Clock skew produces phantomdownstates that waste hours of debugging. - Monitor RabbitMQ reachability and AMQP reconnect log lines from each compute host — the bus is the heartbeat path.
- Watch placement inventory for zeroed
VCPU/MEMORY_MBproviders, which signal a resource-tracker that has stopped updating. - Tie
nova-computestartup tolibvirtdhealth so a dead libvirt is caught before it stalls the heartbeat. - For ad-hoc triage, the free incident assistant can summarize nova-compute logs into the likely cause. See more in OpenStack guides.
Quick Command Reference
# State and last report time
openstack compute service list --service nova-compute -c Host -c State -c "Updated At" -c Status
# Is the process alive?
docker ps -a --filter name=nova_compute
sudo systemctl status nova-compute --no-pager
# Message bus + clock from the compute host
nc -zv <RABBIT_VIP> 5672
docker logs nova_compute 2>&1 | grep -iE 'AMQP|rabbit|reconnect' | tail -10
timedatectl status | grep -E 'synchronized|NTP'
# libvirt + resource tracker
sudo systemctl status libvirtd --no-pager
docker logs nova_compute 2>&1 | grep -iE 'update_available_resource|libvirt' | tail -20
# Placement inventory for the host
openstack resource provider list --name <HOST>
openstack resource provider inventory list <RP_UUID>
# Restart the service
docker restart nova_compute
sudo systemctl restart nova-compute
# Disable + force-down for safe evacuation, then restore
openstack compute service set --disable --disable-reason "hw failure" <HOST> nova-compute
nova service-force-down <HOST> nova-compute
openstack server evacuate <SERVER>
openstack compute service set --up --enable <HOST> nova-compute
Conclusion
A hypervisor in down state means conductor has not heard from nova-compute within service_down_time. The instances may be fine; the heartbeat is what failed. The usual root causes:
- The nova-compute process is dead, crashed, or OOM-killed.
- RabbitMQ is unreachable, so the alive-report never lands.
- Clock skew makes report timestamps look stale.
- The placement resource provider went stale with zeroed inventory.
libvirtdis down, stalling the resource-tracker behind it.- A broken cell DB or conductor RPC path drops the service record.
Anchor on the stale Updated At, confirm whether the process is alive, then check the bus and clock — and only force-down a host you have genuinely confirmed lost before evacuating.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.