A 504 Gateway Timeout in OpenStack is a latency problem wearing an HTTP status code. The
load balancer in front of your control plane — in a
Kolla-Ansible deployment that is HAProxy bound to
the internal and external VIPs — proxies each API request to a backend and waits up to timeout server
for the first response byte. When a backend is slow or wedged, HAProxy stops waiting and hands the client a 504.
The proxy is the messenger; the real fault is downstream.
The fastest way to resolve a 504 is to stop guessing and walk the request path: confirm the 504 is coming from HAProxy (not an edge proxy or CDN), time each backend API, then follow the slow one down through Keystone, RabbitMQ, and MariaDB until you find the hop that is actually stalling. This guide is the runbook I use for exactly that. Prefer it in your hand during an incident? Grab the free runbook pack above.
Symptoms
You are probably here because you are seeing one or more of these:
- Horizon shows
504 Gateway Time-out(the HAProxy/nginx error page) on login or when opening a panel. - CLI or API calls hang for ~30–60s and then fail:
Gateway Timeout (HTTP 504)orUnable to establish connection. - Some requests succeed and others 504 — often the slower, heavier calls (image lists, volume creates, live network topology).
- HAProxy logs show termination state
sH(server timeout waiting for response headers). - Intermittent 504s under load that disappear when traffic drops — a classic sign of a saturated backend, DB, or queue.
Likely causes
In production OpenStack, 504s almost always trace back to one of these, roughly in order of frequency:
- Keystone latency. Slow token issuance/validation (Fernet key caching, DB latency) slows every other API, because they all call Keystone.
- RabbitMQ / RPC stalls. An API accepts your request but blocks waiting on an
oslo.messagingreply that never arrives — a backed-up queue, a dead consumer, or missed heartbeats. See the RPC timeout guide. - MariaDB / Galera pressure. Connection exhaustion (
max_connections), slow queries, or Galera flow control pausing writes cluster-wide starves API workers. - A wedged or OOM-killed API worker. The container is "up" but its WSGI/uWSGI workers are hung or the process was OOM-killed and is restarting.
- HAProxy backend marked DOWN or all sessions queued because health checks are failing.
- Horizon-specific: exhausted Apache/mod_wsgi workers or an unreachable memcached, which stalls session and token caching.
Immediate checks
Ninety seconds of triage narrows the search dramatically. First, confirm the layer and reproduce with timing:
# Bypass any edge proxy/CDN — hit the OpenStack VIP directly
VIP=your-internal-vip
# Time each control-plane API independently; the slow one is your lead
for p in 5000 8774 8776 9696; do
curl -s -o /dev/null -w "port $p: %{http_code} %{time_total}s\n" https://$VIP:$p/
done
# 5000=Keystone 8774=Nova 8776=Cinder 9696=Neutron A single port that is 3–10× slower than the others is the hop to chase. If Keystone (5000) is slow, fix it first — everything depends on it.
Next, confirm the backends HAProxy sees are actually up. A backend flapping DOWN will 504 every
request routed to it while health checks fail:
docker exec haproxy sh -c 'echo "show stat" | socat stdio /var/lib/kolla/haproxy/haproxy.sock' \
| awk -F, 'NR==1 || $18=="DOWN" {print $1"/"$2" -> "$18}' Any backend printed as DOWN is your answer — go fix that service rather than HAProxy.
Diagnostic commands
Keystone (check this early)
time openstack token issue -f value -c id
curl -s -o /dev/null -w '%{http_code} %{time_total}s\n' https://$VIP:5000/v3
docker logs --tail=100 keystone 2>&1 | grep -Ei "error|timeout|deadlock|too many connections" Token issue over ~1s points at Keystone → MariaDB. See our guide on debugging Keystone auth for the deeper dive.
If Keystone is the slow hop, our walkthrough on debugging Keystone identity & auth and Fernet key handling covers the usual culprits (key rotation, caching, DB latency).
Nova / Cinder / Neutron APIs
# These calls themselves travel over RPC — a hang here implicates RabbitMQ
openstack compute service list
openstack volume service list
openstack network agent list
docker logs --tail=80 nova_api 2>&1 | grep -Ei "timeout|MessagingTimeout|error"
docker logs --tail=80 cinder_api 2>&1 | grep -Ei "timeout|MessagingTimeout|error"
docker logs --tail=80 neutron_server 2>&1 | grep -Ei "timeout|AMQP|error" A MessagingTimeout in an API log means the API is healthy but its RPC peer or RabbitMQ is not — pivot to the RabbitMQ checks.
RabbitMQ
docker exec rabbitmq rabbitmqctl cluster_status
docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
| awk 'NR>1 && ($2>100 || $3==0) {print}'
docker exec rabbitmq rabbitmqctl list_connections state | grep -c blocked
docker logs --tail=100 rabbitmq 2>&1 | grep -Ei "missed heartbeats|partition|closing" messages climbing with consumers=0 means the consuming service can't drain the queue. blocked connections mean a memory/disk watermark (backpressure).
MariaDB / Galera
docker exec mariadb mysqladmin status
docker exec mariadb mysql -N -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';
SHOW GLOBAL VARIABLES LIKE 'max_connections';"
docker exec mariadb mysql -N -e "SHOW STATUS LIKE 'wsrep_ready';
SHOW STATUS LIKE 'wsrep_flow_control_paused';" Threads_connected near max_connections starves API workers. wsrep_flow_control_paused > 0 means a lagging Galera node is throttling the whole cluster — see our Galera flow-control tuning guide.
Fix & remediation steps
Map the slow hop you found to the smallest safe remediation:
- Wedged API worker → restart just that container.
- Keystone slow → fix the root cause (DB latency, key caching); restart
keystoneonly if workers are hung. - RabbitMQ consumer dead (consumers=0) → restart the consuming service so it re-subscribes; treat the broker as a last resort.
- MariaDB saturation → find and kill the offending queries / raise pool sizing deliberately; for Galera, restore the lagging node rather than restarting the cluster.
- HAProxy config drift (e.g. after a cert or endpoint change) →
reconfigureHAProxy from Kolla.
# Restart a single stateless API service
docker restart nova_api # or cinder_api / neutron_server / keystone
# Reconfigure one service from Kolla after config/cert drift
kolla-ansible -i <inventory> reconfigure --tags haproxy
# Stateful services: targeted and single-node only — never blind-restart a partitioned cluster
docker restart rabbitmq Restart, then immediately re-run the timing check from the Immediate checks section to confirm recovery.
Grab the copy/paste version of this runbook
The OpenStack 504 Gateway Runbook Pack bundles every command on this page — HAProxy, Horizon, Keystone, Nova/Cinder/Neutron, RabbitMQ, MariaDB — plus an escalation workflow and an incident notes template, in one print-ready PDF.
- 504 triage checklist (top-to-bottom)
- HAProxy, Horizon, and Keystone checks
- Nova / Cinder / Neutron API checks
- RabbitMQ + MariaDB latency checks
- Kolla-Ansible container restart commands
- Escalation workflow + incident notes template
No account needed · single opt-in · we never share your email.
Validation steps
Don’t declare victory on a single successful request. Confirm the fix held:
- Re-run the per-port timing loop from Immediate checks — every API should answer well under 1s.
- Load Horizon and open the panels that were failing (Instances, Volumes, Network Topology).
- Confirm HAProxy shows all backends
UPwith no queued sessions. - Run a real workflow end to end:
openstack server list, create and delete a tiny volume, list networks. - Watch for 5–15 minutes under normal load — intermittent 504s under pressure mean the underlying saturation isn’t fully resolved.
for p in 5000 8774 8776 9696; do
curl -s -o /dev/null -w "port $p: %{http_code} %{time_total}s\n" https://$VIP:$p/
done
openstack server list >/dev/null && echo "nova OK"
openstack volume service list -f value -c Status | sort | uniq -c Prevention
- Alert on backend latency, not just up/down. Track per-API p95 latency and Keystone token-issue time; a 504 should never be your first signal. Our OpenStack + Prometheus guide covers the exporters.
- Watch RabbitMQ queue depth and MariaDB connections as leading indicators — see diagnosing RabbitMQ queue buildup and tuning Galera flow control.
- Right-size API workers and DB connection pools for real concurrency; default worker counts rarely match production load.
- Keep HAProxy timeouts honest — tuned to a realistic SLO, not inflated to mask slow APIs.
- Capacity-plan the control plane the way you would data plane; see capacity planning for OpenStack.
- Turn recurring incidents into reusable diagnostic runbooks, and use the free AI Incident Response assistant to draft triage steps fast.
Want the always-current prompts and tools behind this workflow? Browse the AI prompt library, the free in-browser DevOps tools, and — when a production incident needs senior hands — work with me directly.