Instance High Availability with OpenStack Masakari

A compute node failing in OpenStack is not, by default, a self-healing event. Nova will happily report the host as down and leave every VM on it stopped until a human runs nova evacuate for each one. On a small cloud at 3am that’s a manual slog. On a large cloud it’s an outage. Masakari is the service that turns “host died, page the on-call” into “host died, instances came back on other hosts, here’s the notification.”

I’ve run Masakari guarding production compute for years. It’s one of the higher-leverage OpenStack services to deploy, but it has sharp edges around fencing that will bite you if you skip them. Here’s the real-world version.

What Masakari recovers from

Masakari watches three failure domains, each with its own monitor:

Host failure (masakari-hostmonitor) — a compute node goes down. Masakari evacuates its instances to healthy hosts.
Process failure (masakari-processmonitor) — a critical process like nova-compute dies on an otherwise-up host. Masakari restarts it.
Instance failure (masakari-instancemonitor) — a VM itself crashes (kernel panic, qemu died). Masakari resets/reboots it.

The host-failure case is the one people deploy Masakari for, and it’s the one with real risk, because evacuation means starting a VM elsewhere while its disk may still be attached to a host you think is dead.

The fencing problem you cannot skip

Here’s the scenario that ruins your week: the management network to a compute node drops, but the node is still alive and still writing to shared storage. Masakari sees it as “down,” evacuates the VMs to new hosts, and now two copies of the same instance are writing to the same Cinder volume. That’s instant, unrecoverable data corruption.

The fix is fencing — Masakari (via the host monitor) must be able to forcibly power off or isolate a host before evacuating it. In practice that’s IPMI/BMC power control through the host monitor’s configuration. If you cannot guarantee the dead host is truly dead, do not enable automatic host recovery. I run host recovery only on clouds where I have reliable BMC-based fencing wired up. Without it, I leave host recovery in reserved_host mode with manual confirmation.

Setting up segments and hosts

Masakari groups compute hosts into failover segments. A segment is a pool that can absorb each other’s evacuations:

# Create a failover segment
openstack segment create compute-prod auto COMPUTE

# Add hosts to the segment, marking some as reserved spares
openstack segment host create node-01 COMPUTE SSH \
  --segment compute-prod
openstack segment host create node-02 COMPUTE SSH \
  --segment compute-prod
openstack segment host create spare-01 COMPUTE SSH \
  --segment compute-prod --reserved True

The recovery_method on the segment (auto, reserved_host, auto_priority, rh_priority) decides where evacuated instances land. I default to reserved_host: keep one or two empty spare nodes that absorb evacuations, so you don’t pile a dead host’s load onto already-busy hosts and cause a cascade.

Watching it work

Masakari records every recovery as a notification. After a host event:

# List recent recovery notifications
openstack notification list

# Inspect one
openstack notification show <notification-id>

A healthy recovery shows type: COMPUTE_HOST, status finished, and the evacuated instances now ACTIVE on other hosts. A failed notification usually means the target hosts had no capacity or the instance had a host-specific dependency (PCI passthrough, local storage, a pinned NUMA topology that no spare could satisfy). Instances with local ephemeral disk or device passthrough generally can’t be evacuated — Masakari isn’t magic, it relies on Nova evacuate, which relies on shared storage or rebuild-from-image.

The config that matters

Tune the monitors so transient blips don’t trigger evacuations:

[host]
monitoring_interval = 60
monitoring_timeout = 30
disable_ipmi_check = False

[process]
restart_retries = 3

The most common false-positive incident I see is a too-aggressive monitoring_interval evacuating a host during a brief network hiccup or a control-plane RabbitMQ stall. Give the host a real chance to respond before you declare it dead and start moving disks around. I pair this with an AI prompt that reads a failed notification plus openstack hypervisor list and tells me whether the failure was capacity, passthrough, or a genuine fencing gap — it sorts the three causes faster than scrolling logs. A few of those live in our prompt library.

My production checklist

Before I trust Masakari with auto host recovery:

Fencing is real and tested — I’ve manually confirmed a BMC power-off actually kills the node.
Reserved spares exist — at least one empty host per segment so evacuations don’t cascade.
Shared storage — instances I want recovered are on Cinder/Ceph, not local disk.
Monitors are tuned — intervals long enough to ignore blips, short enough to recover quickly.
Notifications go somewhere — every recovery alerts a human even though it self-healed, so I know it happened.

Where to go next

Masakari is the difference between a dead compute node being an outage and being a notification. But it’s only safe with fencing — without reliable host isolation, automatic evacuation is a data-corruption machine. Wire up BMC fencing, keep reserved spares, and tune your monitors before you enable auto recovery. For more on the Nova and live-migration mechanics Masakari builds on, see the OpenStack category.

High-availability automation can cause data loss without proper fencing. Validate your fencing path before enabling automatic host recovery in production.