AI-Assisted Neutron Security Group and Port Binding

There is a particular flavor of OpenStack ticket I have come to dread more than almost any other: “the VM booted but it has no network.” It booted, so Nova is happy. It has a port, so Neutron thinks it is happy. And yet packets go nowhere. Nine times out of ten the answer is one of two things — the port never bound to an agent, or a security group is silently dropping traffic. After years of chasing these, I have a workflow, and lately an AI assistant rides shotgun. Useful for speed. Never trusted with my cloud.

Let me walk you through how I untangle it.

Start at the port, not the VM

The single most informative command in this whole investigation is openstack port show. Everything else flows from what it tells you.

openstack port show <port-id>

The two fields I go to immediately are binding_vif_type and binding_host_id. A healthy port shows binding_vif_type as ovs, bridge, or whatever your driver uses. A broken one shows the word I have come to hate:

binding_vif_type   | binding_failed

binding_failed means Neutron asked the ML2 mechanism drivers to bind the port to the compute host and every driver declined. The VM has a logical port and no datapath. This is not a security group problem yet — this is an agent problem. I will paste the full port show output into ChatGPT and ask it to flag anomalies, because it is good at spotting a mismatched binding_host_id or an empty binding_profile faster than I scan the YAML. But it is reading what I give it; it has no access to my environment, and that is deliberate.

Pro Tip: If binding_host_id is blank on a port attached to a running VM, the port was created but never properly bound at attach time. That points at the Nova/Neutron handoff, not the agent itself — check nova-compute.log for the port binding call.

binding_failed means: find the missing agent

When a port shows binding_failed, the mechanism driver on that host could not claim it. Almost always the agent for that driver is down, missing, or running a config that does not match the network type.

openstack network agent list
openstack network agent list --host compute-07
openstack network agent show <agent-id>

Look at the Alive and State columns. An xxx in Alive on the OVS or OVN agent for the host where the VM landed is your smoking gun. Maybe neutron-openvswitch-agent crashed, or the ovn-controller on that host lost its connection to the southbound DB. Restart it, re-trigger the bind, and the binding_vif_type flips to a real value:

# after fixing the agent, nudge the port to re-bind:
openstack port set --host compute-07 <port-id>

This is where an AI is genuinely a fast junior engineer: I describe “OVS agent dead on compute-07, port binding_failed,” and it instantly drafts the agent-restart and re-bind sequence, with the systemd unit names for my distro. Saves me a minute of recall. Then I read every command before it runs, because the model occasionally suggests a neutron- legacy command that no longer exists on a newer release.

OVS vs. OVN vs. DVR changes everything

You cannot debug binding without knowing your backend. The agent topology and the logs you read differ completely:

ML2/OVS centralized — neutron-openvswitch-agent on every compute, L3 routing on dedicated network nodes. Read /var/log/neutron/openvswitch-agent.log.
ML2/OVS + DVR — east-west and floating-IP routing distributed onto computes. A binding or FIP issue might live on the compute, not a central node.
ML2/OVN — no per-host Neutron agents in the classic sense; ovn-controller and the northbound/southbound databases do the work. Read ovn-controller logs and check ov-vsctl show.

I make a point of telling the AI which backend I run up front. The number of times an assistant has confidently told me to check neutron-l3-agent logs on an OVN cloud — where that agent does not exist — is exactly why context matters. Give it the wrong frame and it generates beautifully wrong advice. I keep backend-specific scoping notes in a prompt pack so I paste the right context every time instead of relying on the model to guess.

Port binds fine but traffic still dies: security groups

Now the harder case. binding_vif_type is healthy, the agent is alive, and packets still vanish. This is the security group layer, and it fails silently — no error, no log line, just dropped traffic. That silence is what makes it brutal.

openstack port show <port-id> -c security_group_ids -c port_security_enabled
openstack security group rule list <sg-id> --long

Check three things in order:

Is port_security_enabled true? If it is false, security groups and anti-spoofing are off entirely. Sometimes that is intentional (an SR-IOV or service port); sometimes someone disabled it and forgot.
Do the rules actually permit the traffic? Neutron default-denies. No matching ingress rule means the packet dies quietly.
Which security groups are attached to this specific port? Not the VM, the port. A multi-NIC instance can have different groups per interface.

openstack security group rule create --proto tcp --dst-port 443 \
  --remote-ip 10.0.0.0/24 <sg-id>

I lean on AI heavily here to read rule sets. A security group with forty rules, half of them remote-group references, is miserable to audit by eye. I paste the rule list output and ask the model to tell me whether TCP 443 from a given CIDR would be permitted. It is fast and usually right — and I still trace the matching rule myself before I believe it, because a missed remote_group_id resolution can make the AI declare traffic allowed when it is not. A handful of well-built audit prompts makes this repeatable.

Allowed-address-pairs: the silent VIP killer

One subclass deserves its own callout because it burns people running keepalived, VRRP, or any floating VIP between instances. Anti-spoofing drops traffic whose source IP/MAC is not the port’s own — unless you have declared the extra addresses.

openstack port show <port-id> -c allowed_address_pairs
openstack port set --allowed-address \
  ip-address=10.0.0.50,mac-address=fa:16:3e:aa:bb:cc <port-id>

If a VIP fails over and the new active node cannot send traffic from the shared address, missing allowed_address_pairs is almost always why. The AI will surface this if you describe “VIP works on node A but not after failover to node B” — that pattern-match is exactly the fast-junior-engineer help that earns its keep. But it does not know your VIP address; you provide it, you set it, you verify it. For ongoing visibility into when these failovers actually break, I wire alerts through the monitoring workflow rather than waiting for a ticket.

The boundary I keep

Tools like GitHub Copilot in my editor and a chat model in another window genuinely accelerate this work — drafting commands, reading dense rule tables, recalling backend specifics. I would not give it up. But the assistant operates strictly on text I hand it. It never gets clouds.yaml, never gets an admin token, and never runs a command against my cloud. Every openstack invocation passes through me first. AI is the fast junior who suggests; the senior who is accountable still types Enter.

Conclusion

Neutron failures split cleanly once you know where to look: binding_failed is an agent problem, silent drops are a security group problem, and failover weirdness is usually allowed-address-pairs. Walk port show, then network agent list, then security group rule list, in that order, and the cause surfaces. Let AI compress the reading and recall; keep the credentials and the final judgment human. The rest of this series lives in the OpenStack category if you want the full set.

AI-Assisted Neutron Security Group and Port Binding Troubleshooting