OpenStack Error Guide: Nova 'Live Migration failure' / Migration Failed
Fix Nova live migration failures in OpenStack: resolve CPU model mismatches, missing shared storage, firewall-blocked libvirt ports, timeouts, and NUMA pinning.
- #openstack
- #troubleshooting
- #errors
- #nova
Overview
A live migration moves a running instance from one compute host to another with no downtime. When it fails, Nova rolls the instance back to ACTIVE on the source host and records the migration as error. The guest keeps running where it started, but your maintenance, rebalancing, or host-evacuation plan is blocked.
You will see the rollback in the instance’s migration list and the nova-compute log:
ERROR nova.compute.manager [instance: 8d2c...] Live Migration failure: internal error: qemu unexpectedly closed the monitor
+--------------------------------------+--------+-----------+------------+--------------+
| ID | Source Compute | Dest Compute | Status | Type |
+--------------------------------------+--------+-----------+------------+--------------+
| 41 | compute-01 | compute-02 | error | live-migration |
Live migration is sensitive because the destination host must be able to resume the exact CPU and memory state of a running guest. Any mismatch in CPU features, storage visibility, libvirt/qemu version, or network reachability between the two hosts can abort it. The good news: because Nova reverts cleanly, a failed live migration rarely harms the workload — you can diagnose and retry safely.
Symptoms
- Instance returns to
ACTIVEon the source after appearing to migrate. openstack server migration listshowsStatus = error.- nova-compute on source or destination logs
Live Migration failure. - The migration can hang for minutes before failing with a timeout.
openstack server migration list --server web-01 -c "Source Compute" -c "Dest Compute" -c Status
+----------------+--------------+--------+
| Source Compute | Dest Compute | Status |
+----------------+--------------+--------+
| compute-01 | compute-02 | error |
+----------------+--------------+--------+
openstack server show web-01 -c "OS-EXT-SRV-ATTR:host" -c status -f value
compute-01
ACTIVE
Common Root Causes
1. CPU model / feature flag incompatibility
The destination CPU must expose every feature the guest was started with. If the source ran a newer CPU (or cpu_mode = host-passthrough) and the destination lacks a flag, qemu refuses to resume.
docker logs nova_libvirt 2>&1 | grep -i 'unsupported configuration' | tail -5 # Kolla-Ansible
sudo journalctl -u libvirtd | grep -i 'unsupported configuration' | tail -5 # Traditional
error: unsupported configuration: guest and host CPU are not compatible: Host CPU does not provide required features: avx512f
Standardize on a named model (e.g. cpu_mode = custom, cpu_models = Haswell-noTSX) so all hosts present an identical baseline.
2. Block migration vs. shared storage mismatch
If the instance disk lives on shared storage (Ceph, NFS) Nova does a normal live migration. With no shared storage you must request block migration, which copies the disk over the wire. Choosing wrong fails immediately.
openstack server migrate --live-migration --block-migration web-01 # no shared storage
openstack server migrate --live-migration --shared-migration web-01 # shared backend
ERROR nova.virt.libvirt.driver [instance: 8d2c...] Migration operation has aborted: Unsafe migration: Migration without shared storage is unsafe
3. libvirt / qemu version mismatch between hosts
A guest started under a newer qemu cannot always be resumed by an older one on the destination. Mixed-version fleets (mid-upgrade) are the classic trigger.
# Compare on both hosts
docker exec nova_libvirt libvirtd --version
docker exec nova_libvirt qemu-system-x86_64 --version | head -1
# Traditional
libvirtd --version && /usr/bin/qemu-system-x86_64 --version | head -1
# compute-01
libvirtd (libvirt) 9.0.0
# compute-02
libvirtd (libvirt) 8.0.0 <-- older, cannot resume newer guest state
4. Missing libvirt live-migration ports / firewall blocks
Live migration data flows over libvirt’s TCP/TLS port and a range of QEMU migration ports. If a firewall (host iptables, security group on the management net, or a hardware firewall) blocks them, the migration stalls then fails.
grep -E '^(migration_port_min|migration_port_max|live_migration_uri|live_migration_inbound_addr)' \
/etc/nova/nova.conf
sudo nc -zv compute-02 16509 # libvirt TLS/TCP
migration_port_min = 49152
migration_port_max = 49215
nc: connect to compute-02 port 16509 (tcp) failed: Connection refused
Open 16509 (libvirtd) plus the 49152:49215 QEMU range between compute hosts on the migration network.
5. Insufficient bandwidth / completion timeout
A busy guest dirties memory faster than it can be copied. If it never converges within live_migration_completion_timeout, Nova aborts.
grep -E 'live_migration_completion_timeout|live_migration_bandwidth|live_migration_permit_post_copy|live_migration_permit_auto_converge' \
/etc/nova/nova.conf
ERROR nova.compute.manager [instance: 8d2c...] Migration operation was cancelled by client: operation aborted: migration job: unexpectedly failed (took too long)
Enable live_migration_permit_auto_converge = true or live_migration_permit_post_copy = true to force convergence on write-heavy guests.
6. SELinux / AppArmor or NUMA/hugepages/CPU-pinning blocks
Two related families. A confinement policy (AppArmor libvirt-qemu, SELinux svirt) can deny the destination access to the migrated disk or socket. Separately, instances with NUMA topology, hugepages, or dedicated CPU pinning require the destination to have the exact free topology, and Nova will refuse if it cannot reproduce it.
sudo dmesg | grep -iE 'apparmor|avc:.*denied' | tail -5
openstack server show web-01 -c properties -f value | grep -iE 'hw:numa|hw:cpu_policy|hw:mem_page_size'
type=AVC msg=audit(...): apparmor="DENIED" operation="open" profile="libvirt-..." name="/var/lib/nova/instances/8d2c.../disk"
For pinned/NUMA guests, confirm the destination has matching free pCPUs and hugepages before retrying.
Diagnostic Workflow
Step 1: Confirm the failure and read the migration record
openstack server migration list --server <SERVER>
openstack server show <SERVER> -c "OS-EXT-SRV-ATTR:host" -c status -f value
Status = error plus the instance back on its source host confirms a rollback. Note source and destination — every later check compares the two.
Step 2: Pull the precise libvirt/qemu error
# Destination host first (it rejects the resume)
docker logs nova_compute 2>&1 | grep -iE 'Live Migration|migration' | tail -20
docker logs nova_libvirt 2>&1 | grep -iE 'error|unsupported|denied' | tail -20
# Traditional
sudo journalctl -u nova-compute | grep -iE 'Live Migration|migration' | tail -20
sudo journalctl -u libvirtd | tail -40
The libvirt line names the real cause: CPU feature, version, storage, or socket denial.
Step 3: Compare CPU and software versions across hosts
for h in compute-01 compute-02; do
echo "== $h =="; ssh $h "docker exec nova_libvirt virsh capabilities | grep -A2 '<model'"
ssh $h "docker exec nova_libvirt libvirtd --version"
done
grep -E 'cpu_mode|cpu_models' /etc/nova/nova-compute.conf /etc/nova/nova.conf 2>/dev/null
Mismatched CPU models or libvirt versions surface here.
Step 4: Test storage visibility and migration network reachability
# Is the disk on shared storage both hosts see?
openstack server show <SERVER> -c "os-extended-volumes:volumes_attached" -f value
# Firewall / libvirt port from source to destination
sudo nc -zv <DEST_HOST> 16509
grep -E 'migration_port_min|migration_port_max|live_migration_inbound_addr' /etc/nova/nova.conf
Refused ports or non-shared disks dictate whether you need --block-migration or a firewall fix.
Step 5: Adjust convergence settings and retry
# In nova.conf on compute hosts:
# live_migration_permit_auto_converge = true
# live_migration_permit_post_copy = true
docker restart nova_compute # Kolla-Ansible
sudo systemctl restart nova-compute # Traditional
# Retry, forcing the right mode
openstack server migrate --live-migration --block-migration --host compute-02 <SERVER>
openstack server migration list --server <SERVER> -c Status
Example Root Cause Analysis
web-01 will not live-migrate from compute-01 to the freshly added compute-02; it reverts to ACTIVE on compute-01 every time. Because the rollback is clean and instant, a host-capability mismatch — not data corruption — is the likely culprit.
The destination libvirt log is explicit:
error: unsupported configuration: guest and host CPU are not compatible: Host CPU does not provide required features: avx512f, avx512dq
Comparing CPU models confirms it:
ssh compute-01 "docker exec nova_libvirt virsh capabilities | grep '<model'"
ssh compute-02 "docker exec nova_libvirt virsh capabilities | grep '<model'"
# compute-01: <model>Skylake-Server-IBRS</model>
# compute-02: <model>Haswell-noTSX</model>
compute-01 runs cpu_mode = host-passthrough, so web-01 was started with AVX-512 — which the older Haswell compute-02 lacks. The guest state simply cannot resume there.
Fix: pin the cluster to a common baseline so every guest is portable:
# nova.conf on all compute hosts
# [libvirt]
# cpu_mode = custom
# cpu_models = Haswell-noTSX
docker restart nova_compute
New instances then advertise only the shared baseline. For the already-running web-01, a cold migration (openstack server migrate web-01) moves it without resuming live CPU state and succeeds immediately.
Prevention Best Practices
- Standardize
cpu_mode = customwith an explicitcpu_modelsbaseline that the oldest host supports, so every guest is live-migratable fleet-wide. - Keep libvirt and qemu versions uniform; during rolling upgrades, drain hosts with cold migration rather than live migration.
- Pre-open the libvirt port (
16509) and the QEMU migration range (49152:49215) on the migration network and smoke-test withncbefore relying on live migration. - Decide shared vs. block migration once per cluster and bake the right flag into your automation; mixing them is a common silent failure.
- Enable
live_migration_permit_auto_converge(and post-copy if acceptable) for write-heavy guests so they converge instead of timing out. - Tag NUMA/hugepage/pinned instances and verify destination topology headroom before migrating them.
- For ad-hoc triage, the free incident assistant can summarize libvirt migration errors into the likely cause. See more in OpenStack guides.
Quick Command Reference
# Confirm failure and current host
openstack server migration list --server <SERVER>
openstack server show <SERVER> -c "OS-EXT-SRV-ATTR:host" -c status -f value
# The real reason (destination host)
docker logs nova_compute 2>&1 | grep -iE 'Live Migration|migration' | tail -20
docker logs nova_libvirt 2>&1 | grep -iE 'error|unsupported|denied' | tail -20
sudo journalctl -u nova-compute | grep -iE 'Live Migration' | tail -20
# Compare CPU model and libvirt/qemu versions
ssh <HOST> "docker exec nova_libvirt virsh capabilities | grep '<model'"
docker exec nova_libvirt libvirtd --version
# Storage visibility + migration ports/firewall
openstack server show <SERVER> -c "os-extended-volumes:volumes_attached" -f value
sudo nc -zv <DEST_HOST> 16509
grep -E 'migration_port_min|migration_port_max|live_migration_inbound_addr' /etc/nova/nova.conf
# Convergence tuning (nova.conf), then restart and retry
grep -E 'live_migration_completion_timeout|permit_auto_converge|permit_post_copy' /etc/nova/nova.conf
docker restart nova_compute
openstack server migrate --live-migration --block-migration --host <DEST> <SERVER>
Conclusion
A Nova live migration failure reverts the guest cleanly to its source, so the workload is safe — but the move is blocked until the host mismatch is resolved. The usual root causes:
- CPU model/feature incompatibility between source and destination.
- Block-vs-shared storage mismatch (wrong migration mode for the backend).
- A libvirt/qemu version gap that prevents resuming guest state.
- Firewall-blocked libvirt (
16509) or QEMU migration ports. - A write-heavy guest that never converges before the completion timeout.
- SELinux/AppArmor denials or NUMA/hugepage/CPU-pinning topology the destination cannot reproduce.
Read the destination libvirt error first — it names the cause — then align CPU baselines, versions, storage, and firewall before retrying. When live migration cannot work, a cold migration almost always will.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.