Troubleshooting Live Migration in OpenStack

Live migration is one of OpenStack’s best operational features: move a running instance off a host you need to patch, with no reboot and barely a hiccup for the workload. It’s also one of the most finicky, because it requires the source and destination hypervisors, the storage, and the network to all cooperate perfectly. When it stalls or fails, you’re left with an instance in MIGRATING and a maintenance window evaporating.

After years of draining compute nodes for patching, here’s how I troubleshoot live migration.

First, know which kind of migration you’re doing

OpenStack has three related operations, and they fail differently:

Live migration (shared storage) — instance disk lives on Ceph/NFS, only memory and CPU state move. The fast, common case.
Live block migration — disk and memory copy over the network. Needed when storage isn’t shared. Much heavier, more failure modes.
Cold migration / resize — instance is stopped, moved, restarted. Not “live” at all.

Confirm your storage model first, because “live migration is slow” on block migration is just physics — you’re copying the whole disk.

Step 1: Pre-flight the obvious incompatibilities

Most live-migration failures are caught before they start if you check compatibility:

openstack hypervisor show <dest-host> -f value -c cpu_info

The destination must be able to provide the CPU the guest currently sees. If your hosts have mixed CPU generations and you used cpu_mode = host-passthrough, an instance booted on a newer host cannot live-migrate to an older one — the guest would lose CPU instructions mid-flight. This is the single most common “why won’t it migrate” cause in heterogeneous clusters.

The fix is a fleet-wide policy decision: use a custom CPU model that names the lowest common denominator, set in nova.conf:

[libvirt]
cpu_mode = custom
cpu_models = Haswell-noTSX

Step 2: Verify connectivity between hypervisors

Live migration copies memory directly between compute nodes over their migration network. If that path is blocked or slow, migration stalls:

# from source compute, to dest compute:
ping <dest-migration-ip>
nc -zv <dest-migration-ip> 49152-49215   # libvirt/QEMU migration ports

A firewall blocking the QEMU migration port range is a classic. So is migration traffic accidentally riding the slow management network instead of a dedicated fast one — set live_migration_inbound_addr to pin it to the right NIC.

Step 3: Watch a stalled migration

When a migration won’t converge, the issue is usually that the guest dirties memory faster than the link can copy it. Watch progress live:

openstack server migration list --server <instance-uuid>
# on the source host:
virsh domjobinfo <instance-name>

domjobinfo shows data remaining and whether it’s shrinking. If “remaining” plateaus, the guest is too write-heavy to converge at the current bandwidth. Two levers:

Enable auto-converge or post-copy in nova.conf (live_migration_permit_auto_converge, live_migration_permit_post_copy). Post-copy guarantees completion by switching the guest to run on the destination while pulling remaining pages on demand.
Raise the migration bandwidth cap if you’ve throttled it.

Step 4: Read the logs on both sides

Migration spans two hosts, so read both:

# source:
grep <instance-uuid> /var/log/nova/nova-compute.log
# destination:
grep <instance-uuid> /var/log/nova/nova-compute.log
tail -f /var/log/libvirt/qemu/<instance-name>.log

The destination QEMU log catches “incoming migration failed” errors — a missing image backing file, an unavailable PCI passthrough device, or a hugepages mismatch. If the instance uses SR-IOV or PCI passthrough, live migration is often simply unsupported for that device; the log will say so.

Recovering a wedged instance

An instance stuck in MIGRATING after the operation truly failed needs careful recovery. Confirm where it’s actually running:

virsh list   # on both hosts — find which one has the live domain

Only after you know the real location do you reset state:

nova reset-state --active <instance-uuid>

Resetting state while you’re unsure which host owns the live domain risks two definitions of the same instance. Verify with virsh first, always.

Using AI to triage migration failures

Live-migration debugging spans CPU flags, two hosts’ logs, and libvirt internals — a lot to hold at once. I paste the CPU info from both hosts, the domjobinfo output, and the QEMU error, and ask:

“Here is the source and destination CPU info, the migration job progress, and the destination QEMU error. Tell me whether this failure is CPU-incompatibility, non-convergence, or device-passthrough, and the read-only command to confirm. Do not suggest reset-state until I’ve verified which host runs the instance.”

It’s good at classifying the failure mode quickly so you apply the right fix instead of cycling through all three. I keep these migration-triage prompts with my other OpenStack prompts.

Make migration boring before you need it

The time to discover your hosts can’t migrate to each other is not during a security-patch window. Test live migration between every pair of host generations when you build the cloud, standardize the CPU model, dedicate a fast migration network, and enable post-copy so migrations always converge. Do that and draining a node for maintenance becomes a non-event. For more operations prompts, browse our prompt library.

AI failure classification is assistive, not authoritative. Verify the instance’s true host before any state reset.