Troubleshooting Nova Compute Failures in OpenStack
When an OpenStack instance won't boot, the error is rarely where you first look. Here's a field-tested order for tracing Nova compute failures from API to hypervisor.
- #openstack
- #nova
- #compute
- #troubleshooting
- #kvm
- #libvirt
A user files a ticket: “My instance is stuck in ERROR.” You run openstack server show and see a vague fault message about scheduling. After 25 years of running infrastructure — and more than a few of those years babysitting OpenStack clouds — I can tell you the error you first see is almost never where the problem lives. Nova spans an API, a conductor, a scheduler, and a compute agent talking to libvirt. The failure could be in any of them.
This is the order I trace Nova compute failures, top to bottom, so I stop guessing and start eliminating.
Start with the instance fault, but don’t trust it
The first command is always:
openstack server show <instance-uuid> -f value -c fault
The fault field gives you a class of problem — NoValidHost, libvirtError, Build of instance aborted — but it’s a summary written at the moment of failure. Treat it as a pointer, not a diagnosis.
The single most common fault is No valid host was found. That is not a compute failure at all. It’s the scheduler telling you that nothing in the cloud satisfied the request.
Step 1: Is it scheduling, or is it the hypervisor?
This fork decides which logs you read next.
If the fault mentions NoValidHost or scheduling, go to the scheduler. Check capacity and filters:
openstack hypervisor list --long
openstack hypervisor stats show
Look for exhausted vCPU, RAM, or disk. Remember Nova applies allocation ratios — a host can be “full” by Nova’s accounting while htop shows idle CPU. The placement service is the source of truth now:
openstack resource provider list
openstack resource provider inventory list <provider-uuid>
If placement shows no capacity for the requested flavor, you’ve found it: capacity, an aggregate/affinity rule, or a flavor extra-spec that no host matches.
Step 2: Read the conductor and scheduler logs
If scheduling looks healthy, follow the request through the control plane. On the controller:
journalctl -u devstack@n-cond -f # or nova-conductor on packaged installs
grep <instance-uuid> /var/log/nova/nova-scheduler.log
grep <instance-uuid> /var/log/nova/nova-conductor.log
Grep by instance UUID across every Nova log. The UUID is your thread through the maze — it appears in API, scheduler, conductor, and compute logs, and stitching them in time order tells the whole story.
Step 3: Get onto the compute node
Once the request reaches a host, the action moves to nova-compute and libvirt. SSH to the host the scheduler picked and read:
grep <instance-uuid> /var/log/nova/nova-compute.log
The errors that show up here are the real compute failures:
libvirtError: internal error— almost always something libvirt/QEMU rejected. Read on.- Image-related errors — Glance download failed, or the backing file is corrupt.
Failed to allocate network— Neutron didn’t return a port in time. This is a networking problem masquerading as compute.
Step 4: Drop down to libvirt and QEMU
When nova-compute.log points at libvirt, go straight to the source:
virsh list --all
virsh dumpxml <instance-name>
tail -100 /var/log/libvirt/qemu/<instance-name>.log
The QEMU per-instance log is gold. It captures the exact qemu-kvm invocation and the error from the kernel or hardware layer — missing virtualization extensions, a CPU model the host can’t provide, an unavailable hugepages mount, or a PCI passthrough device already in use.
A quick sanity check I run on any new compute node:
egrep -c '(vmx|svm)' /proc/cpuinfo # 0 means no hardware virt
virt-host-validate
virt-host-validate catches a surprising number of “instance won’t start” problems before they ever reach a ticket.
Using AI to correlate the log trail
Tracing one UUID across four log files at 2am is exactly the kind of tedious correlation an LLM does well — as long as you keep it read-only. I paste the grepped lines from scheduler, conductor, compute, and the QEMU log and ask:
“Here are timestamped log lines for one instance UUID across Nova scheduler, conductor, compute, and the QEMU per-instance log. Build a timeline, tell me the first line where things go wrong, and explain the mechanism. Suggest only read-only verification commands.”
The model is good at noticing that the network-allocation error at T+8s is downstream of a Neutron timeout at T+2s — the kind of ordering insight that saves you from “fixing” the wrong service. I keep a few of these tracing prompts in my prompt library so I’m not writing them during an outage.
The failures I see most often
After enough incidents, the long tail collapses into a handful of usual suspects:
- Allocation-ratio surprises. The host “looks empty” but placement says full. Adjust
cpu_allocation_ratiodeliberately, not reactively. - Stale allocations in placement. A failed build leaves a phantom allocation consuming capacity.
nova-manage placement auditfinds them. - Time skew. A compute node with a drifting clock causes token and messaging failures that surface as random build aborts. Check
chronyc trackingeverywhere. - Neutron timeouts blamed on Nova. “Failed to allocate network” is a Neutron/RabbitMQ problem 90% of the time.
- Image format mismatch. A raw image on a qcow2-expecting backend, or vice versa.
A repeatable runbook beats heroics
The reason I trace in this fixed order — fault, scheduler/placement, conductor, compute, libvirt/QEMU — is that it converts a scary “instance is broken” into a deterministic process of elimination. You’re never staring at the whole stack at once; you’re answering one yes/no question at a time and moving down a layer.
If you want a head start on the prompts and checklists for this, we keep a growing set of OpenStack troubleshooting prompts aimed at exactly these control-plane-to-hypervisor traces. Steal them, adapt them to your log paths, and keep the human reading every command before it runs.
AI-generated command suggestions are assistive, not authoritative. Verify against your own cloud before acting in production.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.