The neutron-l3-agent is the process that turns Neutron's logical routers into real forwarding
on a network or compute node. For each virtual router it creates a Linux network namespace
(qrouter-<id>), wires the internal qr- and external qg- interfaces,
and programs iptables for SNAT and the DNAT rules behind every floating IP. It also periodically
reports its liveness to neutron-server with an oslo.messaging report_state RPC call over
RabbitMQ.
When openstack network agent list shows that agent as XXX (or alive=False),
it means neutron-server stopped receiving those heartbeat reports — not that the agent binary is
necessarily gone. Because the reporting path runs over RabbitMQ, an openstack neutron l3 agent dead
verdict is far more often a messaging or RPC problem than a crashed process. This guide helps you tell a stale
heartbeat apart from a genuine neutron router down event, and gives you the copy-paste commands to
localize and fix each one.
Symptoms
You are probably here because you are seeing one or more of these:
openstack network agent listshows the L3 agent withAlive = XXXand oftenState = UP— the classic neutron l3 agent XXX state.- Instances can reach each other and their gateway internally, but cannot reach the outside world.
- Existing floating IPs stop responding to ping/SSH from outside — openstack floating ip not working.
- Recently-created floating IPs never work at all, even though older ones may still forward.
- Routers on that agent appear down from a tenant's perspective; north-south traffic drops while east-west is fine.
- neutron-server logs show the agent's
report_stategoing stale, orAMQP/MessagingTimeouterrors around the same time.
Likely causes
In production Neutron, a dead L3 agent almost always traces back to one of these, roughly in order of frequency:
- Missed RabbitMQ heartbeats / lost AMQP connection. The agent's
report_statenever reaches neutron-server because its broker connection dropped or is flapping — a neutron-l3-agent heartbeat timeout in all but name. - neutron-server RPC overloaded or down. The server side can't keep up with (or isn't consuming) state reports, so every agent it serves drifts to
XXXtogether. - Agent process crashed or was OOM-killed. The container is restarting, or the Python process is wedged — the one case where the binary really is the problem.
- Clock skew. If the agent node's clock is ahead/behind neutron-server, a fresh
report_statecan look older than theagent_down_timewindow and be judged dead. - HA (VRRP/keepalived) split. For L3 HA routers, a keepalived split-brain or lost VRRP peering can leave two masters or none, which reads as broken routing even when the agent reports fine.
- Missing namespace. The
qrouter-<id>namespace was never created or got torn down, so the router genuinely has no data plane on that node.
Immediate checks
Ninety seconds of triage tells you whether this is a stale heartbeat or a real outage. Start with the agent list and the container:
# Which L3 agents are alive, and how long since each last reported?
openstack network agent list --agent-type l3 \
-c ID -c Host -c Alive -c State -c "Binary"
# Detail for the specific agent (shows heartbeat_timestamp / configurations)
openstack network agent show $agent_id
# Is the container actually up, or crash-looping? (Kolla-Ansible)
docker ps --filter name=neutron_l3_agent --format '{{.Names}} {{.Status}}' Alive=XXX with the container Up and no restarts points at the heartbeat/RPC path, not the binary. A container that is Restarting or recently exited points at a crash/OOM — jump to the neutron-server + logs checks.
If the container is up, immediately check whether the data plane is actually broken before you touch anything — a stale heartbeat with working traffic is not an emergency:
# List the router namespaces this node hosts — presence means a data plane exists
docker exec neutron_l3_agent ip netns list | grep qrouter
# When did the agent last report in? (compare against agent_down_time, default 75s)
docker exec neutron_l3_agent grep -Ei "report_state|heartbeat|AMQP" \
/var/log/kolla/neutron/neutron-l3-agent.log | tail -n 20 If the qrouter namespaces are present and instances still reach the internet, you likely have a heartbeat problem, not an outage — diagnose calmly rather than restarting into a real interruption.
Diagnostic commands
RabbitMQ heartbeat checks
docker exec rabbitmq rabbitmqctl cluster_status
docker exec rabbitmq rabbitmqctl list_connections name state timeout \
| grep -Ei "neutron|blocked" | head -n 20
docker logs --tail=200 rabbitmq 2>&1 | grep -Ei "missed heartbeats|partition|closing" missed heartbeats from a neutron client, blocked connections, or a cluster partition all break report_state delivery — the agent is fine but its reports never arrive. This is the number-one cause of a dead L3 agent.
If you find missed heartbeats or a partition here, the root cause is the broker, not Neutron — work the RabbitMQ missed heartbeats and RabbitMQ RPC timeout playbooks, and the agents recover on their own once messaging is healthy.
neutron-server RPC checks
# These very calls travel over RPC — a hang here implicates RabbitMQ/neutron-server
openstack network agent list >/dev/null && echo "neutron-server RPC OK"
docker logs --tail=200 neutron_server 2>&1 \
| grep -Ei "AMQP|MessagingTimeout|report_state|Timed out waiting|Agent.*dead" MessagingTimeout or AMQP errors in neutron_server mean the server can't talk to RabbitMQ, so every agent it tracks goes XXX together. If only one agent is dead, the fault is that node's connection, not the server.
L3 namespace & router checks
# All router namespaces on this node
docker exec neutron_l3_agent ip netns list
# Interfaces inside a specific router ns — expect qr-* (internal) and qg-* (external/SNAT)
docker exec neutron_l3_agent ip netns exec qrouter-$router_id ip -brief a
# Default route inside the namespace should point out the qg- interface
docker exec neutron_l3_agent ip netns exec qrouter-$router_id ip route A missing qrouter-<id> namespace, or a namespace with no qg- interface, is a real data-plane fault — see the router namespace missing guide. If qr-/qg- are present with correct addresses, routing is intact and this is a reporting issue.
A missing namespace is its own failure mode — our router namespace missing walkthrough covers forcing the agent to recreate it, and the broader debugging Neutron networking guide maps the full qr-/qg-/namespace model.
HA router troubleshooting
# For L3 HA routers, each agent runs keepalived inside the qrouter namespace
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
ip -brief a | grep -i ha-
# Exactly one agent hosting this router should own the VIP (be MASTER)
docker exec neutron_l3_agent grep -Ei "entering (master|backup) state|transition" \
/var/log/kolla/neutron/neutron-keepalived-state-change.log | tail -n 20 Two MASTERs (split-brain) or zero MASTERs means VRRP peering over the HA network is broken — check that the ha- interfaces can reach each other. This causes a neutron router down symptom even when the agent reports alive.
Floating IP validation
# The DNAT/SNAT rules that make floating IPs work live in the router namespace
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
iptables -t nat -S | grep -Ei "float|DNAT|SNAT"
# The agent must answer ARP for the floating IP on the external qg- interface
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
arping -c 3 -I qg-$qg_suffix $floating_ip
# From inside the ns, can the router itself reach the external gateway?
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
ping -c 3 $external_gateway_ip A floating IP with no matching DNAT rule was never converged by the agent — typical when it was created after the agent lost RPC. Missing ARP replies on qg- mean the outside world can't find the floating address at all.
Fix & remediation steps
Map the cause you found to the smallest safe remediation:
- RabbitMQ missed heartbeats / partition → fix the broker first; restart the consuming agent so it re-subscribes only after messaging is healthy. The agent flips back to
:-)on its own once reports land. - neutron-server RPC stuck → restart
neutron_serverso it re-establishes its AMQP consumers; every agent it tracks should recover together. - Agent crashed / OOM-killed → restart the container and check memory limits; this is the one case where restarting the agent is the fix.
- Clock skew → resync NTP on the agent node; no Neutron restart needed once the clock is correct.
- Missing namespace → restarting the agent forces it to recreate the router namespaces on startup.
# Agent-side fix: re-subscribe this node's L3 agent (interrupts non-HA routers)
docker restart neutron_l3_agent
# Server-side fix: only if neutron_server RPC is the problem
docker restart neutron_server
# If the root cause was RabbitMQ, fix the broker FIRST, then restart the
# consuming agent so it re-establishes its AMQP connection:
# 1) resolve RabbitMQ (see the RabbitMQ RPC timeout guide)
# 2) docker restart neutron_l3_agent Restart the narrowest component that explains the symptom, then immediately re-run openstack network agent list --agent-type l3 to confirm the agent is Alive=:-) again.
Grab the RabbitMQ RPC Timeout Runbook Pack
Because a dead L3 agent is usually a messaging problem, this pack is the companion runbook: RabbitMQ cluster + queue-depth commands, heartbeat and consumer checks, oslo.messaging config review, and a service restart decision tree — the exact path from a stale report_state back to a healthy agent.
- OpenStack RPC timeout checklist
- RabbitMQ cluster + queue depth commands
- Consumer / publisher + heartbeat checks
- oslo.messaging config review
- Nova/Cinder/Neutron/Heat symptom matrix
- Service restart decision tree + notes template
No account needed · single opt-in · we never share your email.
Validation steps
Don't declare victory on the agent list alone — confirm the data plane really works:
- Re-run
openstack network agent list --agent-type l3— the agent showsAlive = :-)and a fresh heartbeat timestamp. - Confirm the
qrouter-<id>namespace is present with itsqr-andqg-interfaces on the hosting node. - Ping and SSH a floating IP from outside the cloud — north-south traffic is restored.
- For HA routers, confirm exactly one agent is elected MASTER for the router and owns the VIP.
- Watch the agent for 5–15 minutes — a heartbeat that goes stale again means the underlying RabbitMQ/RPC issue isn't fully resolved.
openstack network agent list --agent-type l3 -c Host -c Alive -c State
docker exec neutron_l3_agent ip netns list | grep qrouter-$router_id
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
ping -c 3 $external_gateway_ip && echo "north-south OK" Prevention
- Alert on agent heartbeat age, not just up/down. Track each L3 agent's
report_state/heartbeat_timestampagainstagent_down_timeso a stale heartbeat pages you before a floating IP does. - Monitor RabbitMQ as a leading indicator — connection churn, blocked connections, and missed heartbeats predict L3 agents flipping to
XXX. Our messaging timeout and RabbitMQ queue buildup notes cover the signals. - Keep NTP/clock sync tight across control and network nodes so a fresh
report_stateis never misread as stale. - Understand floating IP and NAT flow so you can tell an agent problem from a data-plane one — see debugging Neutron floating IPs & NAT.
- Consider migrating to OVN. OVN eliminates the per-node L3 agents and the heartbeat model entirely; our Neutron-to-OVN migration guide covers whether it's worth it for your cloud.
- Turn recurring incidents into repeatable playbooks, and use the free AI Incident Response assistant to draft triage steps fast.
Want the always-current prompts and tools behind this workflow? Browse the AI prompt library, the free in-browser DevOps tools, and — when a production incident needs senior hands — work with me directly.