OpenStack Troubleshooting

OpenStack 504 Gateway Timeout: Diagnose & Fix

Q: Should I just increase the HAProxy timeout to fix a 504?

Only as a deliberate, temporary stopgap. Raising timeout server hides the symptom but a control-plane API that takes 60 seconds to answer is a bug — usually a slow Keystone token validation, a stuck RabbitMQ reply queue, or MariaDB connection exhaustion. Fix the slow hop; don’t normalize the latency.

Q: How do I know if RabbitMQ or MariaDB is causing the 504?

Look for MessagingTimeout in the API logs (points at RabbitMQ/RPC) or Threads_connected near max_connections and Galera wsrep_flow_control_paused > 0 (points at MariaDB). A healthy API that logs an RPC timeout is telling you the fault is downstream — see our RabbitMQ RPC timeout guide.

Q: Is it safe to restart OpenStack containers to clear a 504?

Restarting a single stateless API container (e.g. nova_api) is low-risk and often clears a wedged worker. Restarting stateful, clustered services — RabbitMQ or MariaDB/Galera — is not: a blind restart of a partitioned cluster can cause data loss or a longer outage. Confirm the cause first and restart the narrowest component.

A 504 on Horizon or the OpenStack APIs means HAProxy gave up waiting for a backend. This runbook walks the request path top-to-bottom — HAProxy, Horizon, Keystone, the Nova/Cinder/Neutron APIs, RabbitMQ, and MariaDB — so you can localize the slow hop and fix it, not just paper over the timeout.

Updated July 3, 2026 11 min read Runbook-style guide · copy/paste commands

Free runbook · PDF

Download the free OpenStack 504 Gateway Runbook Pack

A print-ready incident runbook for chasing 504s across HAProxy, Horizon, Keystone, Nova/Cinder/Neutron, RabbitMQ, and MariaDB.

504 triage checklist (top-to-bottom)
HAProxy, Horizon, and Keystone checks
Nova / Cinder / Neutron API checks
RabbitMQ + MariaDB latency checks
Kolla-Ansible container restart commands
Escalation workflow + incident notes template

No account needed · single opt-in · we never share your email.

A 504 Gateway Timeout in OpenStack is a latency problem wearing an HTTP status code. The load balancer in front of your control plane — in a Kolla-Ansible deployment that is HAProxy bound to the internal and external VIPs — proxies each API request to a backend and waits up to timeout server for the first response byte. When a backend is slow or wedged, HAProxy stops waiting and hands the client a 504. The proxy is the messenger; the real fault is downstream.

The fastest way to resolve a 504 is to stop guessing and walk the request path: confirm the 504 is coming from HAProxy (not an edge proxy or CDN), time each backend API, then follow the slow one down through Keystone, RabbitMQ, and MariaDB until you find the hop that is actually stalling. This guide is the runbook I use for exactly that. Prefer it in your hand during an incident? Grab the free runbook pack above.

Symptoms

You are probably here because you are seeing one or more of these:

Horizon shows 504 Gateway Time-out (the HAProxy/nginx error page) on login or when opening a panel.
CLI or API calls hang for ~30–60s and then fail: Gateway Timeout (HTTP 504) or Unable to establish connection.
Some requests succeed and others 504 — often the slower, heavier calls (image lists, volume creates, live network topology).
HAProxy logs show termination state sH (server timeout waiting for response headers).
Intermittent 504s under load that disappear when traffic drops — a classic sign of a saturated backend, DB, or queue.

Likely causes

In production OpenStack, 504s almost always trace back to one of these, roughly in order of frequency:

Keystone latency. Slow token issuance/validation (Fernet key caching, DB latency) slows every other API, because they all call Keystone.
RabbitMQ / RPC stalls. An API accepts your request but blocks waiting on an oslo.messaging reply that never arrives — a backed-up queue, a dead consumer, or missed heartbeats. See the RPC timeout guide.
MariaDB / Galera pressure. Connection exhaustion (max_connections), slow queries, or Galera flow control pausing writes cluster-wide starves API workers.
A wedged or OOM-killed API worker. The container is "up" but its WSGI/uWSGI workers are hung or the process was OOM-killed and is restarting.
HAProxy backend marked DOWN or all sessions queued because health checks are failing.
Horizon-specific: exhausted Apache/mod_wsgi workers or an unreachable memcached, which stalls session and token caching.

Immediate checks

Ninety seconds of triage narrows the search dramatically. First, confirm the layer and reproduce with timing:

Confirm the 504 layer and time each API

# Bypass any edge proxy/CDN — hit the OpenStack VIP directly
VIP=your-internal-vip

# Time each control-plane API independently; the slow one is your lead
for p in 5000 8774 8776 9696; do
  curl -s -o /dev/null -w "port $p: %{http_code}  %{time_total}s\n" https://$VIP:$p/
done
# 5000=Keystone  8774=Nova  8776=Cinder  9696=Neutron

A single port that is 3–10× slower than the others is the hop to chase. If Keystone (5000) is slow, fix it first — everything depends on it.

Next, confirm the backends HAProxy sees are actually up. A backend flapping DOWN will 504 every request routed to it while health checks fail:

HAProxy backend status (Kolla-Ansible socket)

docker exec haproxy sh -c 'echo "show stat" | socat stdio /var/lib/kolla/haproxy/haproxy.sock' \
  | awk -F, 'NR==1 || $18=="DOWN" {print $1"/"$2" -> "$18}'

Any backend printed as DOWN is your answer — go fix that service rather than HAProxy.

Diagnostic commands

Keystone (check this early)

Baseline auth latency

time openstack token issue -f value -c id
curl -s -o /dev/null -w '%{http_code} %{time_total}s\n' https://$VIP:5000/v3
docker logs --tail=100 keystone 2>&1 | grep -Ei "error|timeout|deadlock|too many connections"

Token issue over ~1s points at Keystone → MariaDB. See our guide on debugging Keystone auth for the deeper dive.

If Keystone is the slow hop, our walkthrough on debugging Keystone identity & auth and Fernet key handling covers the usual culprits (key rotation, caching, DB latency).

Nova / Cinder / Neutron APIs

Are the service agents alive, and which API is slow?

# These calls themselves travel over RPC — a hang here implicates RabbitMQ
openstack compute service list
openstack volume service list
openstack network agent list

docker logs --tail=80 nova_api 2>&1     | grep -Ei "timeout|MessagingTimeout|error"
docker logs --tail=80 cinder_api 2>&1   | grep -Ei "timeout|MessagingTimeout|error"
docker logs --tail=80 neutron_server 2>&1 | grep -Ei "timeout|AMQP|error"

A MessagingTimeout in an API log means the API is healthy but its RPC peer or RabbitMQ is not — pivot to the RabbitMQ checks.

RabbitMQ

Queue backlog, dead consumers, blocked connections

docker exec rabbitmq rabbitmqctl cluster_status
docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
  | awk 'NR>1 && ($2>100 || $3==0) {print}'
docker exec rabbitmq rabbitmqctl list_connections state | grep -c blocked
docker logs --tail=100 rabbitmq 2>&1 | grep -Ei "missed heartbeats|partition|closing"

messages climbing with consumers=0 means the consuming service can't drain the queue. blocked connections mean a memory/disk watermark (backpressure).

MariaDB / Galera

Connection saturation and Galera flow control

docker exec mariadb mysqladmin status
docker exec mariadb mysql -N -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';
  SHOW GLOBAL VARIABLES LIKE 'max_connections';"
docker exec mariadb mysql -N -e "SHOW STATUS LIKE 'wsrep_ready';
  SHOW STATUS LIKE 'wsrep_flow_control_paused';"

Threads_connected near max_connections starves API workers. wsrep_flow_control_paused > 0 means a lagging Galera node is throttling the whole cluster — see our Galera flow-control tuning guide.

Fix & remediation steps

Map the slow hop you found to the smallest safe remediation:

Wedged API worker → restart just that container.
Keystone slow → fix the root cause (DB latency, key caching); restart keystone only if workers are hung.
RabbitMQ consumer dead (consumers=0) → restart the consuming service so it re-subscribes; treat the broker as a last resort.
MariaDB saturation → find and kill the offending queries / raise pool sizing deliberately; for Galera, restore the lagging node rather than restarting the cluster.
HAProxy config drift (e.g. after a cert or endpoint change) → reconfigure HAProxy from Kolla.

Least-blast-radius restart / reconfigure (Kolla-Ansible)

# Restart a single stateless API service
docker restart nova_api            # or cinder_api / neutron_server / keystone

# Reconfigure one service from Kolla after config/cert drift
kolla-ansible -i <inventory> reconfigure --tags haproxy

# Stateful services: targeted and single-node only — never blind-restart a partitioned cluster
docker restart rabbitmq

Restart, then immediately re-run the timing check from the Immediate checks section to confirm recovery.

Free runbook · PDF

Grab the copy/paste version of this runbook

The OpenStack 504 Gateway Runbook Pack bundles every command on this page — HAProxy, Horizon, Keystone, Nova/Cinder/Neutron, RabbitMQ, MariaDB — plus an escalation workflow and an incident notes template, in one print-ready PDF.

504 triage checklist (top-to-bottom)
HAProxy, Horizon, and Keystone checks
Nova / Cinder / Neutron API checks
RabbitMQ + MariaDB latency checks
Kolla-Ansible container restart commands
Escalation workflow + incident notes template

No account needed · single opt-in · we never share your email.

Validation steps

Don’t declare victory on a single successful request. Confirm the fix held:

Re-run the per-port timing loop from Immediate checks — every API should answer well under 1s.
Load Horizon and open the panels that were failing (Instances, Volumes, Network Topology).
Confirm HAProxy shows all backends UP with no queued sessions.
Run a real workflow end to end: openstack server list, create and delete a tiny volume, list networks.
Watch for 5–15 minutes under normal load — intermittent 504s under pressure mean the underlying saturation isn’t fully resolved.

Post-fix validation

for p in 5000 8774 8776 9696; do
  curl -s -o /dev/null -w "port $p: %{http_code}  %{time_total}s\n" https://$VIP:$p/
done
openstack server list >/dev/null && echo "nova OK"
openstack volume service list -f value -c Status | sort | uniq -c

Prevention

Alert on backend latency, not just up/down. Track per-API p95 latency and Keystone token-issue time; a 504 should never be your first signal. Our OpenStack + Prometheus guide covers the exporters.
Watch RabbitMQ queue depth and MariaDB connections as leading indicators — see diagnosing RabbitMQ queue buildup and tuning Galera flow control.
Right-size API workers and DB connection pools for real concurrency; default worker counts rarely match production load.
Keep HAProxy timeouts honest — tuned to a realistic SLO, not inflated to mask slow APIs.
Capacity-plan the control plane the way you would data plane; see capacity planning for OpenStack.
Turn recurring incidents into reusable diagnostic runbooks, and use the free AI Incident Response assistant to draft triage steps fast.

Want the always-current prompts and tools behind this workflow? Browse the AI prompt library, the free in-browser DevOps tools, and — when a production incident needs senior hands — work with me directly.

Free runbook · PDF

Download the free OpenStack 504 Gateway Runbook Pack

A print-ready incident runbook for chasing 504s across HAProxy, Horizon, Keystone, Nova/Cinder/Neutron, RabbitMQ, and MariaDB.

504 triage checklist (top-to-bottom)
HAProxy, Horizon, and Keystone checks
Nova / Cinder / Neutron API checks
RabbitMQ + MariaDB latency checks
Kolla-Ansible container restart commands
Escalation workflow + incident notes template

No account needed · single opt-in · we never share your email.

Frequently asked questions

What does a 504 Gateway Timeout mean in OpenStack?

It means the proxy in front of OpenStack — almost always HAProxy on the internal or external VIP — waited for a backend API to respond and gave up. HAProxy returns the 504; the actual fault is a slow or unresponsive backend such as Keystone, a service API (Nova, Cinder, Neutron), RabbitMQ, or MariaDB. The 504 tells you where the request died, not why.

Why does Horizon return 504 but the CLI works?

Horizon makes many API calls to render a single page, so it amplifies latency: an API that is merely slow (say 3–5s) can push Horizon past HAProxy’s timeout server while a single CLI call still squeaks under it. Horizon also depends on its own Apache/mod_wsgi workers and memcached. Time the underlying API directly with curl -w before blaming Horizon.

Should I just increase the HAProxy timeout to fix a 504?

Only as a deliberate, temporary stopgap. Raising timeout server hides the symptom but a control-plane API that takes 60 seconds to answer is a bug — usually a slow Keystone token validation, a stuck RabbitMQ reply queue, or MariaDB connection exhaustion. Fix the slow hop; don’t normalize the latency.

How do I know if RabbitMQ or MariaDB is causing the 504?

Look for MessagingTimeout in the API logs (points at RabbitMQ/RPC) or Threads_connected near max_connections and Galera wsrep_flow_control_paused > 0 (points at MariaDB). A healthy API that logs an RPC timeout is telling you the fault is downstream — see our RabbitMQ RPC timeout guide.

Is it safe to restart OpenStack containers to clear a 504?

Restarting a single stateless API container (e.g. nova_api) is low-risk and often clears a wedged worker. Restarting stateful, clustered services — RabbitMQ or MariaDB/Galera — is not: a blind restart of a partitioned cluster can cause data loss or a longer outage. Confirm the cause first and restart the narrowest component.

Which OpenStack service should I check first for a 504?

Keystone. Every other API validates tokens against Keystone on nearly every request, so if Keystone is slow, everything downstream times out and 504s. Baseline auth latency with time openstack token issue before chasing individual services.

OpenStack 504 Gateway Timeout: Diagnose & Fix

Download the free OpenStack 504 Gateway Runbook Pack

Symptoms

Likely causes

Immediate checks

Diagnostic commands

Keystone (check this early)

Nova / Cinder / Neutron APIs

RabbitMQ

MariaDB / Galera

Fix & remediation steps

Grab the copy/paste version of this runbook

Validation steps

Prevention

Download the free OpenStack 504 Gateway Runbook Pack

Frequently asked questions

neutron-l3-agent Dead / XXX State

Cinder Scheduler Timeout

Kolla-Ansible Certificate Update

RabbitMQ RPC Timeout in OpenStack

OpenStack AI Prompt Library

Download the free OpenStack 504 Gateway Runbook Pack

Symptoms

Likely causes

Immediate checks

Diagnostic commands

Keystone (check this early)

Nova / Cinder / Neutron APIs

RabbitMQ

MariaDB / Galera

Fix & remediation steps

Grab the copy/paste version of this runbook

Validation steps

Prevention

Download the free OpenStack 504 Gateway Runbook Pack

Frequently asked questions

Related troubleshooting guides

neutron-l3-agent Dead / XXX State

Cinder Scheduler Timeout

Kolla-Ansible Certificate Update

RabbitMQ RPC Timeout in OpenStack

OpenStack AI Prompt Library