Migrating Neutron to OVN Networking in OpenStack

For years the OpenStack networking story was a sprawl of agents — the L3 agent, DHCP agent, metadata agent, and an OVS agent on every node, all coordinating over RabbitMQ. It worked, but it was a lot of moving parts to keep alive, and RabbitMQ became a bottleneck at scale. OVN (Open Virtual Network) changes the model: it pushes the logic into a distributed database and OpenFlow flows programmed directly onto each hypervisor’s Open vSwitch, killing most of those agents. After running both, I’m convinced OVN is where you want to be — but the migration and the debugging are genuinely different, and the old neutron agent-list muscle memory will mislead you. Here’s the real picture.

What changes when you move to OVN

With OVN, the ml2/ovs mechanism driver is replaced by ml2/ovn. Instead of per-node L3/DHCP/metadata agents talking over RabbitMQ, you get:

A northbound DB (NB) that holds the logical network model Neutron writes to (logical switches, routers, ports, ACLs).
A southbound DB (SB) that holds the physical realization — which chassis (hypervisor) hosts what, and the logical flows.
ovn-controller on every compute/network node, which reads the SB and programs OpenFlow rules into local OVS.

Routing becomes distributed (each hypervisor does its own east-west and floating-IP routing), DHCP and metadata are served natively by OVN, and RabbitMQ is largely out of the data path. Fewer agents, less message-bus load, faster failover. The cost: a new mental model and new debugging tools.

Step 1: How to migrate without a flag day

You don’t rip out OVS and bolt on OVN in one shot on a live cloud. The supported path uses the neutron-ovn-migration tooling (TripleO/OpenStack-Ansible ship playbooks for it), and the safe sequence is:

Stand up the OVN NB/SB databases and ovn-northd on the controllers.
Migrate configuration so Neutron writes to ml2/ovn while existing ports keep working.
Roll ovn-controller onto hypervisors and migrate data-plane state node by node.
Decommission the old L3/DHCP/metadata/OVS agents once each node is converted.

The single biggest piece of advice: do this in a maintenance window with a tested rollback, validate on a staging cloud first, and migrate hypervisors in small batches confirming connectivity after each. Network migrations are the one place “move fast” gets you a region-wide outage.

Step 2: Verify OVN is healthy

After migration, forget neutron agent-list for the data plane — query OVN directly. Check the databases and the chassis:

# Northbound: logical model Neutron wrote
ovn-nbctl show
# Southbound: physical realization + which chassis are up
ovn-sbctl show
ovn-sbctl list chassis

Every compute node should appear as a chassis in ovn-sbctl show. A node that’s missing means its ovn-controller isn’t registered — that node’s VMs will lose networking. Confirm the controller and northd:

systemctl status ovn-controller
systemctl status ovn-northd
ovn-nbctl get-connection
ovs-vsctl get open_vswitch . external_ids:ovn-remote

ovn-remote on each node must point at the SB DB; a wrong or stale value is the classic “this one node has no connectivity after migration” bug.

Step 3: Debug a port that has no connectivity

When a specific VM can’t talk, trace it through the logical model. Find its logical port and confirm it’s bound to a chassis:

ovn-nbctl show | grep -A2 <port-or-network>
ovn-sbctl show | grep -A3 <chassis>
# Is the port actually bound to a hypervisor?
ovn-sbctl find Port_Binding logical_port=<neutron-port-id>

A Port_Binding with an empty chassis means OVN knows about the port logically but hasn’t bound it to a hypervisor — usually the VM’s host ovn-controller is down or the port was created while the node was offline. The killer feature here is ovn-trace, which simulates a packet through the logical pipeline without sending anything:

ovn-trace <logical-switch> 'inport=="<port>" && eth.dst==<mac> && ip4.dst==<dst-ip>'

That shows you exactly which logical flow drops or forwards the packet — security group ACL, router, or NAT — without tcpdump archaeology.

Step 4: Floating IPs and NAT under OVN

Floating IPs become distributed NAT (DNAT/SNAT) entries in OVN rather than something the L3 agent does centrally. If a floating IP doesn’t work after migration, check the NAT rules in the northbound DB:

ovn-nbctl lr-nat-list <logical-router>

A missing or wrong NAT entry points at a Neutron-to-OVN sync issue. For gateway-bound traffic, confirm which chassis hosts the gateway port (ovn-sbctl show lists gateway chassis) — distributed FIP still pins the external gateway to specific chassis, and if that chassis is down, north-south traffic for those networks stops.

Step 5: Watch the databases at scale

The OVN SB DB is now critical infrastructure — if it’s unavailable, no ovn-controller can update flows. Run the NB/SB DBs in a clustered (RAFT) setup across your controllers, monitor their disk and connection counts, and watch ovn-northd processing latency. A bloated SB DB or a northd that’s falling behind shows up as networking changes taking minutes to apply — the OVN equivalent of the old RabbitMQ backlog.

Where AI helps

OVN’s strength — pushing logic into databases and flows — is also what makes it opaque when it breaks. The ovn-nbctl show, ovn-sbctl show, and ovn-trace output is dense. I’ll paste those into a model and ask it to confirm the port is bound, find the chassis, and read the ovn-trace output to tell me which logical flow dropped the packet. It’s genuinely good at parsing a trace and pointing at “this ACL dropped it” faster than I’d find it by hand.

Keep a saved OVN triage prompt and pair it with our broader OpenStack guides, especially the Neutron deep-dive — OVN is Neutron’s future, but the concepts build on each other. The model reads the databases and the trace; you run every ovn-*ctl command and you migrate hypervisors in small, reversible batches.

Generated commands are assistive, not authoritative. Always verify against your own deployment and test the migration on staging before touching production networking.