Skip to content
DevOps AI ToolKit
Newsletter
OpenStack Troubleshooting

neutron-l3-agent Dead (XXX State): Fix Guide

A neutron-l3-agent showing XXX or dead in the agent list almost never means the agent binary crashed — it means neutron-server stopped receiving the agent's report_state heartbeat, usually because RabbitMQ or RPC is unhealthy. This runbook separates a stale heartbeat from a real data-plane outage, then walks the RabbitMQ, L3 namespace, HA router, and floating IP checks to fix it fast.

Updated July 3, 2026 11 min read Runbook-style guide · copy/paste commands

Free runbook · PDF

Download the free RabbitMQ RPC Timeout Runbook Pack

A copy/paste runbook for oslo.messaging timeouts and missed heartbeats — cluster health, queue depth, and a service restart decision tree.

  • OpenStack RPC timeout checklist
  • RabbitMQ cluster + queue depth commands
  • Consumer / publisher + heartbeat checks
  • oslo.messaging config review
  • Nova/Cinder/Neutron/Heat symptom matrix
  • Service restart decision tree + notes template

No account needed · single opt-in · we never share your email.

The neutron-l3-agent is the process that turns Neutron's logical routers into real forwarding on a network or compute node. For each virtual router it creates a Linux network namespace (qrouter-<id>), wires the internal qr- and external qg- interfaces, and programs iptables for SNAT and the DNAT rules behind every floating IP. It also periodically reports its liveness to neutron-server with an oslo.messaging report_state RPC call over RabbitMQ.

When openstack network agent list shows that agent as XXX (or alive=False), it means neutron-server stopped receiving those heartbeat reports — not that the agent binary is necessarily gone. Because the reporting path runs over RabbitMQ, an openstack neutron l3 agent dead verdict is far more often a messaging or RPC problem than a crashed process. This guide helps you tell a stale heartbeat apart from a genuine neutron router down event, and gives you the copy-paste commands to localize and fix each one.

Symptoms

You are probably here because you are seeing one or more of these:

  • openstack network agent list shows the L3 agent with Alive = XXX and often State = UP — the classic neutron l3 agent XXX state.
  • Instances can reach each other and their gateway internally, but cannot reach the outside world.
  • Existing floating IPs stop responding to ping/SSH from outside — openstack floating ip not working.
  • Recently-created floating IPs never work at all, even though older ones may still forward.
  • Routers on that agent appear down from a tenant's perspective; north-south traffic drops while east-west is fine.
  • neutron-server logs show the agent's report_state going stale, or AMQP / MessagingTimeout errors around the same time.

Likely causes

In production Neutron, a dead L3 agent almost always traces back to one of these, roughly in order of frequency:

  • Missed RabbitMQ heartbeats / lost AMQP connection. The agent's report_state never reaches neutron-server because its broker connection dropped or is flapping — a neutron-l3-agent heartbeat timeout in all but name.
  • neutron-server RPC overloaded or down. The server side can't keep up with (or isn't consuming) state reports, so every agent it serves drifts to XXX together.
  • Agent process crashed or was OOM-killed. The container is restarting, or the Python process is wedged — the one case where the binary really is the problem.
  • Clock skew. If the agent node's clock is ahead/behind neutron-server, a fresh report_state can look older than the agent_down_time window and be judged dead.
  • HA (VRRP/keepalived) split. For L3 HA routers, a keepalived split-brain or lost VRRP peering can leave two masters or none, which reads as broken routing even when the agent reports fine.
  • Missing namespace. The qrouter-<id> namespace was never created or got torn down, so the router genuinely has no data plane on that node.

Immediate checks

Ninety seconds of triage tells you whether this is a stale heartbeat or a real outage. Start with the agent list and the container:

Confirm the agent state and container health
# Which L3 agents are alive, and how long since each last reported?
openstack network agent list --agent-type l3 \
  -c ID -c Host -c Alive -c State -c "Binary"

# Detail for the specific agent (shows heartbeat_timestamp / configurations)
openstack network agent show $agent_id

# Is the container actually up, or crash-looping? (Kolla-Ansible)
docker ps --filter name=neutron_l3_agent --format '{{.Names}}  {{.Status}}'

Alive=XXX with the container Up and no restarts points at the heartbeat/RPC path, not the binary. A container that is Restarting or recently exited points at a crash/OOM — jump to the neutron-server + logs checks.

If the container is up, immediately check whether the data plane is actually broken before you touch anything — a stale heartbeat with working traffic is not an emergency:

Is routing actually down, or just the report?
# List the router namespaces this node hosts — presence means a data plane exists
docker exec neutron_l3_agent ip netns list | grep qrouter

# When did the agent last report in? (compare against agent_down_time, default 75s)
docker exec neutron_l3_agent grep -Ei "report_state|heartbeat|AMQP" \
  /var/log/kolla/neutron/neutron-l3-agent.log | tail -n 20

If the qrouter namespaces are present and instances still reach the internet, you likely have a heartbeat problem, not an outage — diagnose calmly rather than restarting into a real interruption.

Diagnostic commands

RabbitMQ heartbeat checks

Is the agent's AMQP connection healthy?
docker exec rabbitmq rabbitmqctl cluster_status
docker exec rabbitmq rabbitmqctl list_connections name state timeout \
  | grep -Ei "neutron|blocked" | head -n 20
docker logs --tail=200 rabbitmq 2>&1 | grep -Ei "missed heartbeats|partition|closing"

missed heartbeats from a neutron client, blocked connections, or a cluster partition all break report_state delivery — the agent is fine but its reports never arrive. This is the number-one cause of a dead L3 agent.

If you find missed heartbeats or a partition here, the root cause is the broker, not Neutron — work the RabbitMQ missed heartbeats and RabbitMQ RPC timeout playbooks, and the agents recover on their own once messaging is healthy.

neutron-server RPC checks

Is the server side receiving and processing reports?
# These very calls travel over RPC — a hang here implicates RabbitMQ/neutron-server
openstack network agent list >/dev/null && echo "neutron-server RPC OK"

docker logs --tail=200 neutron_server 2>&1 \
  | grep -Ei "AMQP|MessagingTimeout|report_state|Timed out waiting|Agent.*dead"

MessagingTimeout or AMQP errors in neutron_server mean the server can't talk to RabbitMQ, so every agent it tracks goes XXX together. If only one agent is dead, the fault is that node's connection, not the server.

L3 namespace & router checks

Does the router namespace exist and have its interfaces?
# All router namespaces on this node
docker exec neutron_l3_agent ip netns list

# Interfaces inside a specific router ns — expect qr-* (internal) and qg-* (external/SNAT)
docker exec neutron_l3_agent ip netns exec qrouter-$router_id ip -brief a

# Default route inside the namespace should point out the qg- interface
docker exec neutron_l3_agent ip netns exec qrouter-$router_id ip route

A missing qrouter-<id> namespace, or a namespace with no qg- interface, is a real data-plane fault — see the router namespace missing guide. If qr-/qg- are present with correct addresses, routing is intact and this is a reporting issue.

A missing namespace is its own failure mode — our router namespace missing walkthrough covers forcing the agent to recreate it, and the broader debugging Neutron networking guide maps the full qr-/qg-/namespace model.

HA router troubleshooting

VRRP / keepalived state inside the HA namespace
# For L3 HA routers, each agent runs keepalived inside the qrouter namespace
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
  ip -brief a | grep -i ha-

# Exactly one agent hosting this router should own the VIP (be MASTER)
docker exec neutron_l3_agent grep -Ei "entering (master|backup) state|transition" \
  /var/log/kolla/neutron/neutron-keepalived-state-change.log | tail -n 20

Two MASTERs (split-brain) or zero MASTERs means VRRP peering over the HA network is broken — check that the ha- interfaces can reach each other. This causes a neutron router down symptom even when the agent reports alive.

Floating IP validation

Are the NAT and ARP entries actually programmed?
# The DNAT/SNAT rules that make floating IPs work live in the router namespace
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
  iptables -t nat -S | grep -Ei "float|DNAT|SNAT"

# The agent must answer ARP for the floating IP on the external qg- interface
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
  arping -c 3 -I qg-$qg_suffix $floating_ip

# From inside the ns, can the router itself reach the external gateway?
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
  ping -c 3 $external_gateway_ip

A floating IP with no matching DNAT rule was never converged by the agent — typical when it was created after the agent lost RPC. Missing ARP replies on qg- mean the outside world can't find the floating address at all.

Fix & remediation steps

Map the cause you found to the smallest safe remediation:

  • RabbitMQ missed heartbeats / partition → fix the broker first; restart the consuming agent so it re-subscribes only after messaging is healthy. The agent flips back to :-) on its own once reports land.
  • neutron-server RPC stuck → restart neutron_server so it re-establishes its AMQP consumers; every agent it tracks should recover together.
  • Agent crashed / OOM-killed → restart the container and check memory limits; this is the one case where restarting the agent is the fix.
  • Clock skew → resync NTP on the agent node; no Neutron restart needed once the clock is correct.
  • Missing namespace → restarting the agent forces it to recreate the router namespaces on startup.
Least-blast-radius restart (Kolla-Ansible)
# Agent-side fix: re-subscribe this node's L3 agent (interrupts non-HA routers)
docker restart neutron_l3_agent

# Server-side fix: only if neutron_server RPC is the problem
docker restart neutron_server

# If the root cause was RabbitMQ, fix the broker FIRST, then restart the
# consuming agent so it re-establishes its AMQP connection:
#   1) resolve RabbitMQ (see the RabbitMQ RPC timeout guide)
#   2) docker restart neutron_l3_agent

Restart the narrowest component that explains the symptom, then immediately re-run openstack network agent list --agent-type l3 to confirm the agent is Alive=:-) again.

Free runbook · PDF

Grab the RabbitMQ RPC Timeout Runbook Pack

Because a dead L3 agent is usually a messaging problem, this pack is the companion runbook: RabbitMQ cluster + queue-depth commands, heartbeat and consumer checks, oslo.messaging config review, and a service restart decision tree — the exact path from a stale report_state back to a healthy agent.

  • OpenStack RPC timeout checklist
  • RabbitMQ cluster + queue depth commands
  • Consumer / publisher + heartbeat checks
  • oslo.messaging config review
  • Nova/Cinder/Neutron/Heat symptom matrix
  • Service restart decision tree + notes template

No account needed · single opt-in · we never share your email.

Validation steps

Don't declare victory on the agent list alone — confirm the data plane really works:

  • Re-run openstack network agent list --agent-type l3 — the agent shows Alive = :-) and a fresh heartbeat timestamp.
  • Confirm the qrouter-<id> namespace is present with its qr- and qg- interfaces on the hosting node.
  • Ping and SSH a floating IP from outside the cloud — north-south traffic is restored.
  • For HA routers, confirm exactly one agent is elected MASTER for the router and owns the VIP.
  • Watch the agent for 5–15 minutes — a heartbeat that goes stale again means the underlying RabbitMQ/RPC issue isn't fully resolved.
Post-fix validation
openstack network agent list --agent-type l3 -c Host -c Alive -c State
docker exec neutron_l3_agent ip netns list | grep qrouter-$router_id
docker exec neutron_l3_agent ip netns exec qrouter-$router_id \
  ping -c 3 $external_gateway_ip && echo "north-south OK"

Prevention

  • Alert on agent heartbeat age, not just up/down. Track each L3 agent's report_state / heartbeat_timestamp against agent_down_time so a stale heartbeat pages you before a floating IP does.
  • Monitor RabbitMQ as a leading indicator — connection churn, blocked connections, and missed heartbeats predict L3 agents flipping to XXX. Our messaging timeout and RabbitMQ queue buildup notes cover the signals.
  • Keep NTP/clock sync tight across control and network nodes so a fresh report_state is never misread as stale.
  • Understand floating IP and NAT flow so you can tell an agent problem from a data-plane one — see debugging Neutron floating IPs & NAT.
  • Consider migrating to OVN. OVN eliminates the per-node L3 agents and the heartbeat model entirely; our Neutron-to-OVN migration guide covers whether it's worth it for your cloud.
  • Turn recurring incidents into repeatable playbooks, and use the free AI Incident Response assistant to draft triage steps fast.

Want the always-current prompts and tools behind this workflow? Browse the AI prompt library, the free in-browser DevOps tools, and — when a production incident needs senior hands — work with me directly.

Free runbook · PDF

Download the free RabbitMQ RPC Timeout Runbook Pack

A copy/paste runbook for oslo.messaging timeouts and missed heartbeats — cluster health, queue depth, and a service restart decision tree.

  • OpenStack RPC timeout checklist
  • RabbitMQ cluster + queue depth commands
  • Consumer / publisher + heartbeat checks
  • oslo.messaging config review
  • Nova/Cinder/Neutron/Heat symptom matrix
  • Service restart decision tree + notes template

No account needed · single opt-in · we never share your email.

Frequently asked questions

What does XXX mean in the neutron agent list?
The XXX in the Alive column of openstack network agent list means neutron-server has not received a recent report_state heartbeat from that agent, so it marks it alive=False. A healthy agent shows :-). XXX is a reporting verdict, not proof the agent binary crashed — the process is often still running but its heartbeat is not reaching the server, usually over RabbitMQ.
Is neutron-l3-agent dead the same as the router being down?
Not necessarily. The agent hosts routers inside Linux network namespaces, and those namespaces keep forwarding traffic even while the agent process is stopped. A dead agent means neutron-server lost the heartbeat — routing may still work. The data plane only breaks when the namespace itself is torn down, an HA router fails over badly, or the node reboots. Always test real traffic before assuming an outage.
How do I restart the L3 agent safely?
Confirm the cause first, then restart the narrowest component. In Kolla-Ansible: docker restart neutron_l3_agent. HA routers fail over to a standby agent, so impact is brief; legacy (non-HA) routers on that agent lose their data plane until it returns. Restart during a maintenance window where you can, and re-run openstack network agent list to confirm :-) before moving on.
Why do floating IPs stop working when the L3 agent is dead?
Floating IPs are DNAT/SNAT rules the L3 agent programs with iptables inside the router namespace, and it also answers ARP for the floating address on the external qg- interface. If the agent never converged the router — for example a floating IP created after the agent lost RPC — those NAT and ARP entries are missing, so external reachability breaks even though internal traffic works.
Can RabbitMQ cause the L3 agent to show dead?
Yes — it is the most common cause. The agent reports state to neutron-server over oslo.messaging (RabbitMQ). If the AMQP connection drops, hits missed heartbeats, or the broker is overloaded, the report_state call never lands and neutron-server flips the agent to XXX even though the L3 process is fine. See the RabbitMQ RPC timeout guide.
Should I switch to OVN to avoid this?
OVN removes the per-node L3 agents and the RPC heartbeat model entirely — routing and NAT are handled by OVN controllers and ovs-vswitchd, so the "agent shows XXX" failure mode largely disappears. It is a real architectural fix, but a migration is a project, not a quick remedy for an active incident. See our Neutron-to-OVN migration guide to plan it deliberately.