Troubleshooting Nova Compute Failures in OpenStack

A user files a ticket: “My instance is stuck in ERROR.” You run openstack server show and see a vague fault message about scheduling. After 25 years of running infrastructure — and more than a few of those years babysitting OpenStack clouds — I can tell you the error you first see is almost never where the problem lives. Nova spans an API, a conductor, a scheduler, and a compute agent talking to libvirt. The failure could be in any of them.

This is the order I trace Nova compute failures, top to bottom, so I stop guessing and start eliminating.

Start with the instance fault, but don’t trust it

The first command is always:

openstack server show <instance-uuid> -f value -c fault

The fault field gives you a class of problem — NoValidHost, libvirtError, Build of instance aborted — but it’s a summary written at the moment of failure. Treat it as a pointer, not a diagnosis.

The single most common fault is No valid host was found. That is not a compute failure at all. It’s the scheduler telling you that nothing in the cloud satisfied the request.

Step 1: Is it scheduling, or is it the hypervisor?

This fork decides which logs you read next.

If the fault mentions NoValidHost or scheduling, go to the scheduler. Check capacity and filters:

openstack hypervisor list --long
openstack hypervisor stats show

Look for exhausted vCPU, RAM, or disk. Remember Nova applies allocation ratios — a host can be “full” by Nova’s accounting while htop shows idle CPU. The placement service is the source of truth now:

openstack resource provider list
openstack resource provider inventory list <provider-uuid>

If placement shows no capacity for the requested flavor, you’ve found it: capacity, an aggregate/affinity rule, or a flavor extra-spec that no host matches.

Step 2: Read the conductor and scheduler logs

If scheduling looks healthy, follow the request through the control plane. On the controller:

journalctl -u devstack@n-cond -f      # or nova-conductor on packaged installs
grep <instance-uuid> /var/log/nova/nova-scheduler.log
grep <instance-uuid> /var/log/nova/nova-conductor.log

Grep by instance UUID across every Nova log. The UUID is your thread through the maze — it appears in API, scheduler, conductor, and compute logs, and stitching them in time order tells the whole story.

Step 3: Get onto the compute node

Once the request reaches a host, the action moves to nova-compute and libvirt. SSH to the host the scheduler picked and read:

grep <instance-uuid> /var/log/nova/nova-compute.log

The errors that show up here are the real compute failures:

libvirtError: internal error — almost always something libvirt/QEMU rejected. Read on.
Image-related errors — Glance download failed, or the backing file is corrupt.
Failed to allocate network — Neutron didn’t return a port in time. This is a networking problem masquerading as compute.

Step 4: Drop down to libvirt and QEMU

When nova-compute.log points at libvirt, go straight to the source:

virsh list --all
virsh dumpxml <instance-name>
tail -100 /var/log/libvirt/qemu/<instance-name>.log

The QEMU per-instance log is gold. It captures the exact qemu-kvm invocation and the error from the kernel or hardware layer — missing virtualization extensions, a CPU model the host can’t provide, an unavailable hugepages mount, or a PCI passthrough device already in use.

A quick sanity check I run on any new compute node:

egrep -c '(vmx|svm)' /proc/cpuinfo   # 0 means no hardware virt
virt-host-validate

virt-host-validate catches a surprising number of “instance won’t start” problems before they ever reach a ticket.

Using AI to correlate the log trail

Tracing one UUID across four log files at 2am is exactly the kind of tedious correlation an LLM does well — as long as you keep it read-only. I paste the grepped lines from scheduler, conductor, compute, and the QEMU log and ask:

“Here are timestamped log lines for one instance UUID across Nova scheduler, conductor, compute, and the QEMU per-instance log. Build a timeline, tell me the first line where things go wrong, and explain the mechanism. Suggest only read-only verification commands.”

The model is good at noticing that the network-allocation error at T+8s is downstream of a Neutron timeout at T+2s — the kind of ordering insight that saves you from “fixing” the wrong service. I keep a few of these tracing prompts in my prompt library so I’m not writing them during an outage.

The failures I see most often

After enough incidents, the long tail collapses into a handful of usual suspects:

Allocation-ratio surprises. The host “looks empty” but placement says full. Adjust cpu_allocation_ratio deliberately, not reactively.
Stale allocations in placement. A failed build leaves a phantom allocation consuming capacity. nova-manage placement audit finds them.
Time skew. A compute node with a drifting clock causes token and messaging failures that surface as random build aborts. Check chronyc tracking everywhere.
Neutron timeouts blamed on Nova. “Failed to allocate network” is a Neutron/RabbitMQ problem 90% of the time.
Image format mismatch. A raw image on a qcow2-expecting backend, or vice versa.

A repeatable runbook beats heroics

The reason I trace in this fixed order — fault, scheduler/placement, conductor, compute, libvirt/QEMU — is that it converts a scary “instance is broken” into a deterministic process of elimination. You’re never staring at the whole stack at once; you’re answering one yes/no question at a time and moving down a layer.

If you want a head start on the prompts and checklists for this, we keep a growing set of OpenStack troubleshooting prompts aimed at exactly these control-plane-to-hypervisor traces. Steal them, adapt them to your log paths, and keep the human reading every command before it runs.

AI-generated command suggestions are assistive, not authoritative. Verify against your own cloud before acting in production.