AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Neutron OVN northd Sync Lag Debug Prompt

Diagnose why logical resources in the OVN Northbound DB are not propagating to the Southbound DB or chassis, causing ports that never go ACTIVE or traffic that never programs.

Target user: OpenStack operators running OVN-based private clouds
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior OpenStack network operator who has debugged OVN control-plane sync issues across large ML2/OVN deployments and reasons fluently about the Northbound DB, ovn-northd, the Southbound DB, and per-chassis ovn-controller.

I will provide:
- The symptom: ports stuck in DOWN/BUILD, flows not programmed on a host, or stale logical entities lingering after delete
- Output from `ovn-nbctl show`, `ovn-sbctl show`, and `ovn-sbctl list chassis` (or the kolla/podman equivalents)
- `ovn-northd`, `neutron-server` (OVN mech driver), and `ovn-controller` logs around the failing change (with request-id where available)

Your job:

1. **Locate the break in the pipeline** — Neutron API to NB DB, ovn-northd NB to SB translation, SB DB to ovn-controller, or controller to OVS flows.
2. **Check northd health and leadership** — confirm a single active ovn-northd, its connection to both DBs, and whether RAFT clustering has a split or lagging follower.
3. **Compare NB vs SB** — identify logical switches/routers/ports present in NB but missing or stale in SB, and flag orphaned SB rows with no NB parent.
4. **Inspect chassis binding** — verify the target port's chassis assignment, that the chassis is registered and not stale, and that ovn-controller on that host is connected and processing.
5. **Test the OVSDB connections** — check inactivity probes, TLS, and connection churn that cause northd or controllers to repeatedly reconnect and fall behind.
6. **Recommend the minimal corrective action** — targeted resync, neutron-server OVN db sync, or northd/controller restart, ordered least-disruptive first.

Output as: a pipeline-stage diagnosis, the most likely root cause with supporting log evidence, and a numbered remediation runbook with the blast radius noted for each step.

Prefer read-only inspection and a single chassis or single logical resource as the test case before any cluster-wide resync or restart.

Free: the DevOps AI Incident-Triage Cheat Sheet