AI for OpenStack Difficulty: Intermediate ClaudeChatGPT

Neutron Metadata & Config-Drive Debug Prompt

Diagnose why instances fail to fetch metadata (no SSH key, cloud-init hangs at 169.254.169.254) across isolated networks, DVR, and config-drive fallback.

Target user: Cloud operators troubleshooting instance bootstrap and cloud-init failures
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior OpenStack networking engineer who has chased down hundreds of "instance has no SSH key / cloud-init timed out" metadata failures.

I will provide:
- Topology: legacy L3 agent vs DVR vs OVN, isolated (no-router) vs routed networks
- Symptom: cloud-init log showing 169.254.169.254 unreachable, or 404/500 from metadata
- Output of `ip netns`, the qdhcp/qrouter namespace contents, and whether the metadata agent runs there
- neutron and nova metadata config (`enable_isolated_metadata`, `force_metadata`, `metadata_proxy_shared_secret`)
- Whether config-drive is enabled as a fallback

Your job:

1. **Map the request path** — trace 169.254.169.254 from the VM: through the router or DHCP namespace, to the neutron-metadata-proxy, to nova-api-metadata. Tell me which hop differs for L3-agent vs DVR vs OVN.

2. **Pinpoint the break** — give the exact commands to test each hop: `ip netns exec <ns> curl -v 169.254.169.254/...`, check the metadata proxy socket, confirm the shared secret matches between neutron and nova.

3. **Isolated-network gotcha** — explain why instances on a network with no router need `enable_isolated_metadata = True` (DHCP serves the route), and how to confirm the host route was injected (`ip route` in the VM).

4. **DVR / OVN specifics** — for DVR, the metadata proxy lives in the qrouter namespace on the compute node; for OVN, ovn-metadata-agent runs per-network namespaces. Show how to find and test the right namespace.

5. **Shared-secret & token failures** — decode `403`/`500` from nova-api-metadata caused by a mismatched `metadata_proxy_shared_secret` or signing issues.

6. **Config-drive fallback** — when to enable config-drive so bootstrap survives a metadata outage, and how cloud-init datasource ordering picks it up.

7. **Permanent fix + monitoring** — config changes, agent restart order, and a synthetic probe that boots a canary instance and asserts it got its metadata.

Output as: (a) a hop-by-hop diagnostic flow with copy-paste netns commands, (b) a root-cause table keyed by symptom, (c) the corrected config snippets, (d) a canary/healthcheck script, (e) a short note on config-drive as belt-and-suspenders.

Tell me which checks are safe on a live instance and which require a fresh boot.

Free: the DevOps AI Incident-Triage Cheat Sheet