Skip to content
CloudOps
Newsletter
All guides
AI for OpenStack By James Joyner IV · · 9 min read

Debugging Neutron Networking in OpenStack

Neutron failures hide behind layers of namespaces, OVS bridges, and security groups. Here's a methodical packet-path approach to debugging OpenStack networking.

  • #openstack
  • #neutron
  • #networking
  • #ovs
  • #troubleshooting
  • #sdn

Neutron is where good OpenStack engineers go to question their career choices. The symptom is simple — “my instance has no network” — but the path a packet takes from a VM’s virtual NIC to the physical wire crosses tap devices, Linux bridges, OVS integration and tunnel bridges, network namespaces, and a router or DHCP agent that lives in its own namespace on a different node.

After years of running these clouds, the only thing that reliably works is to stop guessing and walk the packet path in order. Here’s how I do it.

First, separate the three failure classes

Before touching any bridge, classify the symptom:

  1. No IP at all — DHCP or port-binding problem.
  2. Has IP, can’t reach the gateway — L2/security-group problem inside the tenant network.
  3. Reaches gateway, can’t reach the internet — L3 router, NAT, or floating-IP problem.

Each class lives in a different place. Confirm which one you have before you go deeper:

openstack port list --device-id <instance-uuid>
openstack port show <port-uuid> -f value -c status -c binding_vif_type

A port status of DOWN or a binding_vif_type of binding_failed means the L2 agent never wired it up — stop here and check the agent, not the router.

Step 1: Confirm the agents are alive

Half of all Neutron “outages” are a dead agent:

openstack network agent list

Any agent with :-) missing — an OVS agent, DHCP agent, or L3 agent showing XXX — is your culprit. On the affected node:

systemctl status neutron-openvswitch-agent
journalctl -u neutron-openvswitch-agent --since "30 min ago"

A common trap: the agent process is “running” but wedged after an RabbitMQ blip and no longer processing RPC. Restart it and watch the port move to ACTIVE.

Step 2: Walk the L2 path with namespaces and OVS

This is the part people fear. It’s just a sequence of three checks.

Find the DHCP namespace and test from inside it:

ip netns list
ip netns exec qdhcp-<network-uuid> ip a
ip netns exec qdhcp-<network-uuid> ping <instance-fixed-ip>

If you can ping the instance from inside its own DHCP namespace, L2 is healthy and your problem is L3. If you can’t, inspect OVS on the compute node:

ovs-vsctl show
ovs-ofctl dump-flows br-int | grep <port-vlan-tag>

The integration bridge br-int tags each port with a local VLAN. A port with tag 4095 (the “dead” tag) means OVS dropped it — usually a binding failure or a wiring race. That single tag value has saved me hours.

Step 3: Check security groups before blaming the fabric

Before you tear apart the overlay, rule out the firewall. Security groups silently drop traffic and look exactly like a broken network:

openstack security group rule list <sg-id>

The classic mistake is a security group with no egress allowed, or ICMP missing so ping fails while SSH would have worked. Add a temporary allow-all rule to test — then remove it. If connectivity returns, the fabric was fine all along.

Step 4: Trace L3, NAT, and floating IPs

If L2 is clean, move to the router namespace:

ip netns exec qrouter-<router-uuid> ip a
ip netns exec qrouter-<router-uuid> ip route
ip netns exec qrouter-<router-uuid> iptables -t nat -L -n -v

Floating-IP problems live in those NAT rules. A missing DNAT entry for the floating IP means the L3 agent didn’t finish wiring it — check the agent log and re-associate the floating IP. For external reachability, confirm the router’s gateway port is ACTIVE and the external network’s physical bridge mapping is correct.

Step 5: Tunnel and MTU gremlins

If two instances on the same network can’t talk across compute nodes but can talk on the same node, you’re in overlay territory. Check that the tunnel endpoints see each other:

ovs-vsctl show | grep -A2 tun
ping <other-compute-tunnel-ip>

And never forget MTU. VXLAN/GRE overhead means an undersized overlay MTU produces the maddening “SSH connects then hangs” symptom — small packets pass, large ones get dropped. Test with ping -M do -s 1450.

Where AI actually helps with Neutron

The OVS flow tables and iptables NAT chains are dense, and reading them under pressure is error-prone. I paste dump-flows output or the router’s iptables -t nat dump into an LLM and ask:

“Here is the OVS flow table for br-int and the NAT rules from a qrouter namespace. Explain in plain English what happens to a packet from fixed IP 10.0.0.5 destined for the internet, and tell me which rule, if any, would drop or fail to translate it. Read-only analysis only.”

It won’t fix anything, but turning a wall of flow rules into a packet-path narrative catches the missing DNAT or the wrong VLAN tag fast. I keep these decoding prompts alongside my other OpenStack prompts so the walkthrough is consistent every time.

The mindset that makes Neutron tractable

Neutron stops being scary when you commit to the packet path: port binding, L2 in the namespace, security groups, L3/NAT, then the overlay. You answer one question per layer and never debug the whole SDN at once.

Keep a node map handy — which agents run where, what the bridge mappings are, what your overlay MTU is — because half of debugging is just knowing the expected topology. Pair that map with a saved set of decode-this-flow-table prompts from our prompt library, and a 45-minute mystery becomes a 10-minute walk down the stack.

AI analysis of network state is assistive, not authoritative. Verify every change against your own topology before applying it.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.