Scaling and Debugging Octavia Load Balancers in OpenStack

Octavia is one of those services that works beautifully in a demo and then teaches you humility the first time a production load balancer wedges in PENDING_CREATE at 3am. After running OpenStack LBaaS through several releases, I’ve learned that almost every Octavia problem traces back to one of three things: the amphora image, the lb-mgmt-net, or the control plane’s ability to talk to the amphora it just booted. Get those three right and Octavia is boring. Get any one wrong and you’ll be chasing ghosts.

How Octavia actually works (and why it breaks)

The default Octavia driver — the amphora driver — boots a small Nova instance (the amphora) running HAProxy for every load balancer you create. The Octavia control plane (octavia-worker, octavia-housekeeping, octavia-health-manager) manages those amphorae over a dedicated management network called lb-mgmt-net. The amphora runs a REST agent; the control plane pushes config to it and listens for heartbeats.

That architecture means a load balancer can fail at any of these seams: Nova can’t boot the amphora, the amphora boots but can’t reach the control plane, or the control plane can’t reach the amphora’s agent. When you understand the seams, debugging gets methodical instead of frantic.

Step 1: Read the provisioning status correctly

Start with the status fields, not the logs:

openstack loadbalancer list --column id --column name \
  --column provisioning_status --column operating_status
openstack loadbalancer show <lb-id>

provisioning_status is the control plane’s lifecycle state (ACTIVE, PENDING_CREATE, PENDING_UPDATE, ERROR). operating_status is data-plane health (ONLINE, OFFLINE, DEGRADED). A load balancer stuck in PENDING_CREATE for more than a couple of minutes means the worker never finished its flow — that’s a control-plane problem. An ACTIVE / ERROR combination means provisioning succeeded but health checks are failing — that’s a data-plane problem.

Step 2: When the amphora won’t boot

If you’re stuck in PENDING_CREATE, find the amphora and check Nova first:

openstack loadbalancer amphora list --loadbalancer <lb-id>
openstack server show <compute_flavor_id from amphora>

The usual culprits: the amphora image isn’t tagged correctly (amp_image_tag in octavia.conf must match the Glance image tag), the flavor is too small, or there’s no capacity. Check the worker log:

journalctl -u octavia-worker -f
# or in Kolla:
docker logs octavia_worker 2>&1 | tail -100

Look for ComputeBuildException or timeouts waiting for the amphora to go ACTIVE. If the amphora boots but provisioning still times out, you’ve moved to the next seam.

Step 3: The lb-mgmt-net trap

This is where most people lose a day. The health-manager binds to an IP on lb-mgmt-net and the amphora must be able to reach it on UDP 5555 (heartbeats) and the control plane must reach the amphora on TCP 9443 (the agent). Verify the security group and the actual reachability:

# From a control node, on the lb-mgmt-net namespace/interface:
nc -zv <amphora-mgmt-ip> 9443

If that fails, your lb-mgmt-net routing or the o-hm0 interface is misconfigured. In a Kolla deploy, confirm octavia_network_interface and the health-manager’s bind_ip/bind_port match what the amphora was told. Certificate mismatches show up here too — the agent uses mutual TLS, so an expired or wrong CA in [haproxy_amphora] will look exactly like a network failure.

Step 4: Failover storms and how to avoid them

When the health-manager misses heartbeats, it triggers a failover — it spins up a replacement amphora. If lb-mgmt-net flaps or the control plane is overloaded, you can get a cascade: every amphora “fails,” every failover spawns a new boot, Nova saturates, and the whole thing snowballs. Tune these in octavia.conf:

[health_manager]
heartbeat_timeout = 60
health_check_interval = 3
failover_threads = 10

[house_keeping]
load_balancer_expiry_age = 604800
amphora_expiry_age = 604800

Raising heartbeat_timeout gives a congested management network slack before declaring an amphora dead. For real HA, run loadbalancer_topology = ACTIVE_STANDBY so each LB has two amphorae with VRRP — a single failover doesn’t drop traffic.

Step 5: Recovering a stuck load balancer

Don’t delete and recreate blindly — that orphans amphorae and ports. Use the built-in failover, which rebuilds the data plane while preserving the LB object:

openstack loadbalancer failover <lb-id>
openstack loadbalancer amphora list --loadbalancer <lb-id>

If the LB is wedged in PENDING_* and won’t move, it usually means a worker died mid-flow. Check for orphaned amphorae in BOOTING and let housekeeping clean them, or as a last resort set the status in the DB — but only after you’ve confirmed no worker is still acting on it.

Where AI speeds this up

Octavia failures produce a lot of correlated noise across octavia-worker, octavia-health-manager, Nova, and Neutron logs. I’ll paste the amphora list, the worker log tail, and the LB status into a model and ask it to build a timeline and tell me which seam — boot, mgmt-net, or agent — failed first. It’s good at spotting that a cert expired five minutes before the failover storm started. Keep a saved Octavia triage prompt so you’re not authoring it during an outage, and browse the rest of our OpenStack guides for the services Octavia depends on.

The model reads and reasons; you run every openstack command yourself after reading it. Octavia is forgiving once you respect its three seams — and unforgiving the moment you don’t.

Generated commands and configs are assistive, not authoritative. Always verify against your own deployment before running anything in production.