Neutron Metadata & Config-Drive Debug Prompt
Diagnose why instances fail to fetch metadata (no SSH key, cloud-init hangs at 169.254.169.254) across isolated networks, DVR, and config-drive fallback.
- Target user
- Cloud operators troubleshooting instance bootstrap and cloud-init failures
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack networking engineer who has chased down hundreds of "instance has no SSH key / cloud-init timed out" metadata failures. I will provide: - Topology: legacy L3 agent vs DVR vs OVN, isolated (no-router) vs routed networks - Symptom: cloud-init log showing 169.254.169.254 unreachable, or 404/500 from metadata - Output of `ip netns`, the qdhcp/qrouter namespace contents, and whether the metadata agent runs there - neutron and nova metadata config (`enable_isolated_metadata`, `force_metadata`, `metadata_proxy_shared_secret`) - Whether config-drive is enabled as a fallback Your job: 1. **Map the request path** — trace 169.254.169.254 from the VM: through the router or DHCP namespace, to the neutron-metadata-proxy, to nova-api-metadata. Tell me which hop differs for L3-agent vs DVR vs OVN. 2. **Pinpoint the break** — give the exact commands to test each hop: `ip netns exec <ns> curl -v 169.254.169.254/...`, check the metadata proxy socket, confirm the shared secret matches between neutron and nova. 3. **Isolated-network gotcha** — explain why instances on a network with no router need `enable_isolated_metadata = True` (DHCP serves the route), and how to confirm the host route was injected (`ip route` in the VM). 4. **DVR / OVN specifics** — for DVR, the metadata proxy lives in the qrouter namespace on the compute node; for OVN, ovn-metadata-agent runs per-network namespaces. Show how to find and test the right namespace. 5. **Shared-secret & token failures** — decode `403`/`500` from nova-api-metadata caused by a mismatched `metadata_proxy_shared_secret` or signing issues. 6. **Config-drive fallback** — when to enable config-drive so bootstrap survives a metadata outage, and how cloud-init datasource ordering picks it up. 7. **Permanent fix + monitoring** — config changes, agent restart order, and a synthetic probe that boots a canary instance and asserts it got its metadata. Output as: (a) a hop-by-hop diagnostic flow with copy-paste netns commands, (b) a root-cause table keyed by symptom, (c) the corrected config snippets, (d) a canary/healthcheck script, (e) a short note on config-drive as belt-and-suspenders. Tell me which checks are safe on a live instance and which require a fresh boot.