Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for OpenStack By James Joyner IV · · 10 min read

OpenStack Error Guide: nova-compute Service State 'down' / Hypervisor Down

Fix the nova-compute 'down' hypervisor state in OpenStack: diagnose dead services, RabbitMQ drops, clock skew, stale placement, and force-down for evacuation.

  • #openstack
  • #troubleshooting
  • #errors
  • #nova

Overview

A hypervisor shows down when nova-conductor has not received a heartbeat from that host’s nova-compute service within service_down_time seconds (default 60). The compute node may still be running instances perfectly, but the control plane believes it is gone: the scheduler stops placing new instances there, and operators may try to evacuate.

You will see this in openstack hypervisor list or openstack compute service list:

+----+------------------+------------+----------+---------+-------+----------------------------+
| ID | Binary           | Host       | Zone     | Status  | State | Updated At                 |
+----+------------------+------------+----------+---------+-------+----------------------------+
| 7  | nova-compute     | compute-04 | nova     | enabled | down  | 2026-06-24T13:41:02.000000 |
+----+------------------+------------+----------+---------+-------+----------------------------+

And when scheduling lands on it, or an API call references it:

Compute service of compute-04 is down. (HTTP 400)

The state is purely a function of the last report time. nova-compute writes its “I’m alive” timestamp into the services table (via conductor RPC) on every periodic loop. If that loop stops, or the message never lands, the host flips to down even if libvirtd keeps the VMs up.

Symptoms

  • openstack hypervisor list or compute service list shows State = down.
  • New instances skip the host; No valid host was found if it was the only candidate.
  • API errors like Compute service of <host> is down.
  • Updated At for the service is stale (older than service_down_time).
openstack compute service list --service nova-compute -c Host -c State -c "Updated At"
+------------+-------+----------------------------+
| Host       | State | Updated At                 |
+------------+-------+----------------------------+
| compute-03 | up    | 2026-06-24T14:02:55.000000 |
| compute-04 | down  | 2026-06-24T13:41:02.000000 |
+------------+-------+----------------------------+

The 20-minute-stale Updated At on compute-04 is the tell: it stopped reporting at 13:41.

Common Root Causes

1. The nova-compute process is actually dead

The simplest cause: the service crashed, OOMed, or was stopped and never restarted. No process means no heartbeat.

# Kolla-Ansible
docker ps --filter name=nova_compute --format '{{.Names}}\t{{.Status}}'
# Traditional packages
sudo systemctl status nova-compute --no-pager
nova_compute   Exited (1) 22 minutes ago

An Exited container or inactive (dead) unit that stopped right at the stale Updated At time confirms it.

2. RabbitMQ connectivity is broken

nova-compute reports up by RPC over the message bus. If the host cannot reach RabbitMQ — network partition, dead rabbit node, expired credentials, AMQP heartbeat timeout — the service runs but never reaches conductor, so it goes down.

# Kolla-Ansible
docker logs nova_compute 2>&1 | grep -iE 'AMQP|rabbit|reconnect' | tail -10
# Traditional
sudo journalctl -u nova-compute | grep -iE 'AMQP|rabbit|reconnect' | tail -10
ERROR oslo.messaging._drivers.impl_rabbit [-] [...] AMQP server on 10.0.0.11:5672 is unreachable: timed out. Trying again in 32 seconds.

The process is alive but isolated. Verify the port from the compute host:

nc -zv 10.0.0.11 5672

3. Clock skew versus service_down_time

The down calculation compares the report timestamp to “now” on the controller. If the compute node’s clock drifts ahead, its timestamps look future-dated and conductor may treat fresh reports as stale or vice versa once corrected.

# On controller and compute
date -u
timedatectl status | grep -E 'synchronized|NTP'
System clock synchronized: no
NTP service: inactive

A skew larger than service_down_time (60s default) reliably produces phantom down states.

4. Placement resource provider gone stale

When a host has been down, nova-compute’s update of its placement resource provider stops. The scheduler then sees stale or zero inventory, compounding No valid host was found even after the service recovers until the next update_available_resource loop runs.

openstack resource provider list --name compute-04
openstack resource provider inventory list <RP_UUID>
+--------------+--------+------------------+----------+----------+
| resource_class | total | reserved | min_unit | max_unit |
+--------------+--------+------------------+----------+----------+
| VCPU         | 0      | 0        | 1        | 0        |

Zeroed inventory means placement still has the provider but no usable capacity reported.

5. libvirtd is down (compute can’t update its resources)

nova-compute depends on libvirtd for the resource-tracker update. If libvirt is dead or hung, the periodic update_available_resource task can error out repeatedly and the service heartbeat may stall behind it.

# Kolla-Ansible
docker ps --filter name=nova_libvirt --format '{{.Names}}\t{{.Status}}'
# Traditional
sudo systemctl status libvirtd --no-pager
docker logs nova_compute 2>&1 | grep -i libvirt | tail -10
ERROR nova.compute.manager [-] Error updating resources for node compute-04: libvirt.libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory

6. Agent-to-conductor RPC / cell database problems

nova-compute talks to conductor, which writes the service record to the cell database. A broken cell mapping, an unreachable cell DB, or a conductor outage means the heartbeat never persists.

# Kolla-Ansible (on controller)
docker exec nova_api nova-manage cell_v2 list_cells
docker logs nova_conductor 2>&1 | grep -iE 'error|cell|DBConnectionError' | tail -10
ERROR oslo_db.sqlalchemy.engine.Connection ... DBConnectionError: (pymysql) Can't connect to MySQL server on 'cell1-db'

Diagnostic Workflow

Step 1: Confirm the state and the last report time

openstack compute service list --service nova-compute -c Host -c State -c "Updated At" -c Status
openstack hypervisor show compute-04 -c state -c status -f value 2>/dev/null

Note the exact Updated At — that timestamp is roughly when the host stopped reporting and anchors every other check.

Step 2: Is the process even running?

# Kolla-Ansible
docker ps -a --filter name=nova_compute
# Traditional
sudo systemctl status nova-compute --no-pager

If dead, restart it and watch for clean startup:

docker restart nova_compute          # Kolla-Ansible
sudo systemctl restart nova-compute  # Traditional

Step 3: Check the message bus and clock from the compute host

nc -zv <RABBIT_VIP> 5672
sudo journalctl -u nova-compute | grep -iE 'AMQP|rabbit' | tail -10
timedatectl status | grep -E 'synchronized|NTP'

A reachable RabbitMQ plus a synchronized clock rules out the two most common silent causes.

Step 4: Verify libvirt and the resource-tracker

sudo systemctl status libvirtd --no-pager           # or: docker ps --filter name=nova_libvirt
docker logs nova_compute 2>&1 | grep -iE 'update_available_resource|libvirt' | tail -20

Repeated resource-update errors mean fix libvirt first; the heartbeat recovers once the periodic task completes.

Step 5: Recover, and force-down if you must evacuate

Once the underlying fault is fixed, the host flips back to up on its next report. If the host is genuinely lost and you need to evacuate instances safely, mark it down so Nova permits evacuation:

# Stop the scheduler from using it
openstack compute service set --disable --disable-reason "hw failure" compute-04 nova-compute
# Tell Nova the host is really down (enables evacuate)
nova service-force-down compute-04 nova-compute   # or: openstack compute service set --down compute-04 nova-compute
openstack server evacuate <SERVER>
# After repair, clear it
openstack compute service set --up --enable compute-04 nova-compute

Example Root Cause Analysis

compute-04 shows down with Updated At frozen at 13:41, yet its instances respond to ping. Because the VMs are alive, the process or its uplink — not the hardware — is suspect.

The container is up, so it is not a dead process:

nova_compute   Up 4 hours

The nova-compute log explains the silence:

ERROR oslo.messaging._drivers.impl_rabbit [-] [c2f1...] AMQP server on 10.0.0.11:5672 is unreachable: [Errno 113] No route to host. Trying again in 32 seconds.

A nc -zv 10.0.0.11 5672 from compute-04 times out, while compute-03 connects fine. A switch port flap had dropped compute-04 off the management VLAN at 13:41 — exactly the stale timestamp. The service was healthy but could not deliver its heartbeat.

Fix: restore the management link, then confirm reconnection:

docker logs nova_compute 2>&1 | grep -i 'Connected to AMQP' | tail -1
openstack compute service list --service nova-compute --host compute-04 -c State
INFO oslo.messaging._drivers.impl_rabbit [-] [c2f1...] Reconnected to AMQP server on 10.0.0.11:5672

Within one report interval the host returns to up and the scheduler resumes placing on it.

Prevention Best Practices

  • Alert on State = down and on stale Updated At: poll openstack compute service list and page the moment any host exceeds service_down_time. A down hypervisor silently shrinks scheduling capacity.
  • Run NTP/chrony on every node and alert on synchronized: no. Clock skew produces phantom down states that waste hours of debugging.
  • Monitor RabbitMQ reachability and AMQP reconnect log lines from each compute host — the bus is the heartbeat path.
  • Watch placement inventory for zeroed VCPU/MEMORY_MB providers, which signal a resource-tracker that has stopped updating.
  • Tie nova-compute startup to libvirtd health so a dead libvirt is caught before it stalls the heartbeat.
  • For ad-hoc triage, the free incident assistant can summarize nova-compute logs into the likely cause. See more in OpenStack guides.

Quick Command Reference

# State and last report time
openstack compute service list --service nova-compute -c Host -c State -c "Updated At" -c Status

# Is the process alive?
docker ps -a --filter name=nova_compute
sudo systemctl status nova-compute --no-pager

# Message bus + clock from the compute host
nc -zv <RABBIT_VIP> 5672
docker logs nova_compute 2>&1 | grep -iE 'AMQP|rabbit|reconnect' | tail -10
timedatectl status | grep -E 'synchronized|NTP'

# libvirt + resource tracker
sudo systemctl status libvirtd --no-pager
docker logs nova_compute 2>&1 | grep -iE 'update_available_resource|libvirt' | tail -20

# Placement inventory for the host
openstack resource provider list --name <HOST>
openstack resource provider inventory list <RP_UUID>

# Restart the service
docker restart nova_compute
sudo systemctl restart nova-compute

# Disable + force-down for safe evacuation, then restore
openstack compute service set --disable --disable-reason "hw failure" <HOST> nova-compute
nova service-force-down <HOST> nova-compute
openstack server evacuate <SERVER>
openstack compute service set --up --enable <HOST> nova-compute

Conclusion

A hypervisor in down state means conductor has not heard from nova-compute within service_down_time. The instances may be fine; the heartbeat is what failed. The usual root causes:

  1. The nova-compute process is dead, crashed, or OOM-killed.
  2. RabbitMQ is unreachable, so the alive-report never lands.
  3. Clock skew makes report timestamps look stale.
  4. The placement resource provider went stale with zeroed inventory.
  5. libvirtd is down, stalling the resource-tracker behind it.
  6. A broken cell DB or conductor RPC path drops the service record.

Anchor on the stale Updated At, confirm whether the process is alive, then check the bus and clock — and only force-down a host you have genuinely confirmed lost before evacuating.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.