Debugging NGINX 502 Bad Gateway and 504 Gateway Timeout With

The page goes white, the curl returns 502 Bad Gateway, and the on-call channel lights up. NGINX is up, the box is up, and yet the request dies somewhere between the proxy and whatever sits behind it. I’ve lost more evenings than I’d like to this exact shape of problem: NGINX itself is rarely the thing that’s broken — it’s the messenger telling you the upstream refused, hung, or vanished. The trick is reading that message correctly, fast, at 2 a.m., when you’re tired and the temptation is to start bumping timeouts at random and reloading until the red goes away.

That’s the part where AI actually earns its seat. Not to fix NGINX for you — it doesn’t have your error.log, your upstream topology, or your blast radius — but to decode a cryptic log line, draft a candidate upstream block, and explain why a directive behaves the way it does. You stay in control: you paste the evidence, you read the reasoning, and you validate every change with nginx -t before it ever touches a live worker. Here’s how I work through 502s and 504s with that loop.

First, separate the two errors

502 and 504 look similar in a browser but mean different things, and conflating them sends you down the wrong path.

A 502 Bad Gateway means NGINX got an invalid or no response from the upstream: the connection was refused, reset, or the upstream returned garbage. The upstream is reachable enough to fail loudly.

A 504 Gateway Timeout means NGINX waited and never heard back. The upstream is alive but slow — or wedged — and NGINX gave up after proxy_read_timeout (or proxy_connect_timeout) expired.

So the very first move is to stop guessing and read the log. NGINX is unusually honest in error.log; it tells you exactly which failure mode it hit.

# Watch the error log live, filtered to upstream problems
sudo tail -f /var/log/nginx/error.log | grep -E "upstream|recv|connect"

Read the error.log line, don’t skim it

The error.log line carries the whole story if you parse it. Three lines I see constantly:

2026/06/20 02:14:07 [error] 2391#2391: *1043 connect() failed (111: Connection refused)
  while connecting to upstream, client: 10.0.3.9, server: app.example.com,
  request: "GET /api/orders HTTP/1.1", upstream: "http://127.0.0.1:8080/api/orders"

2026/06/20 02:31:55 [error] 2391#2391: *1190 upstream timed out (110: Connection timed out)
  while reading response header from upstream, client: 10.0.3.9, server: app.example.com,
  request: "POST /export HTTP/1.1", upstream: "http://10.0.5.21:8080/export"

2026/06/20 02:48:02 [error] 2391#2391: *1320 no live upstreams while connecting to upstream,
  client: 10.0.3.9, server: app.example.com, request: "GET / HTTP/1.1",
  upstream: "http://backend/"

Each one points at a different root cause:

connect() failed (111: Connection refused) → nothing is listening on that host:port. The app crashed, hasn’t started, bound to the wrong interface, or you’re pointing at the wrong port. This is a 502.
upstream timed out (110: Connection timed out) → the upstream accepted the connection but didn’t respond in time. Slow query, deadlock, GC pause, or a genuinely heavy request. The phrase while reading response header vs while connecting tells you whether proxy_read_timeout or proxy_connect_timeout fired. This is usually a 504.
no live upstreams → NGINX already marked every server in the upstream block as failed (via max_fails/fail_timeout) and has nothing left to route to. This is a 502, and it means the problem started earlier — go back in the log to find the first failures.

This is a good place to lean on AI. Paste the raw line and ask for a decode rather than trying to remember what errno 111 vs 110 means at 2 a.m.

Prompt: “I’m debugging NGINX. Here is a line from error.log. Tell me: (1) is this a 502 or 504, (2) what exactly failed and at which phase (connect vs read), (3) the 3 most likely root causes ranked, and (4) which NGINX directive controls this behavior. Do not suggest config changes yet — just diagnose. Line: *1190 upstream timed out (110: Connection timed out) while reading response header from upstream ... upstream: "http://10.0.5.21:8080/export"”

AI output: “(1) This presents as a 504. (2) The failure was during reading the response header — NGINX connected successfully but the upstream did not return headers before proxy_read_timeout elapsed. (3) Ranked causes: a slow /export operation exceeding the timeout; an upstream process blocked on a downstream dependency (DB/queue); a worker pool exhaustion on the upstream so the request queued. (4) The governing directive is proxy_read_timeout (default 60s), not proxy_connect_timeout, because the connect phase succeeded. Confirm by checking whether /export consistently takes longer than your current proxy_read_timeout.”

That’s the right division of labor. The model decodes and ranks; you confirm against your actual system before changing anything.

Confirm the upstream from the outside

Before touching nginx.conf, prove what the log claims. If it says connection refused on 127.0.0.1:8080, go check that directly from the same box NGINX runs on.

# Is anything actually listening on the upstream port?
sudo ss -ltnp | grep ':8080'

# Try the upstream directly, bypassing NGINX entirely
curl -v --max-time 5 http://127.0.0.1:8080/healthz

# For a PHP-FPM socket upstream, check the socket exists and is owned correctly
ls -l /run/php/php8.3-fpm.sock

If ss shows nothing on 8080, NGINX is right and the bug is in your app or its service unit — fix that, not the proxy. If the upstream responds fine to a direct curl but NGINX still 502s, now you’ve narrowed it to the NGINX-to-upstream hop: wrong port in proxy_pass, a firewall between NGINX and a remote upstream, or SELinux blocking the connection (setsebool -P httpd_can_network_connect 1 is the classic miss on RHEL boxes).

Tuning timeouts for 504s — deliberately, not reflexively

If the log says upstream timed out while reading response header and you’ve confirmed the operation is legitimately slow (not wedged), then raising proxy_read_timeout is a valid fix — for that specific location only. Don’t raise it globally; a 5-minute read timeout on your whole server turns one slow endpoint into a way to exhaust worker connections.

upstream app_backend {
    server 10.0.5.21:8080 max_fails=3 fail_timeout=30s;
    server 10.0.5.22:8080 max_fails=3 fail_timeout=30s;
    keepalive 32;
}

server {
    listen 443 ssl;
    server_name app.example.com;

    location /export {
        proxy_pass http://app_backend;

        # This endpoint runs a long report — give it room, but scope it here only
        proxy_connect_timeout 5s;     # fail fast if the upstream won't accept
        proxy_read_timeout    180s;   # allow the slow response to complete
        proxy_send_timeout    30s;

        # Required for upstream keepalive to actually work
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }

    location / {
        proxy_pass http://app_backend;
        proxy_connect_timeout 5s;
        proxy_read_timeout 60s;       # default-ish for everything else
    }
}

A few opinions baked in there. Keep proxy_connect_timeout short (a few seconds) — if a backend can’t even accept a connection in 5 seconds, it’s down, and you want to fail to the next upstream quickly rather than hang the client. Reserve the long proxy_read_timeout for the one location that needs it. And if you run a keepalive pool to your upstreams, you must set proxy_http_version 1.1 and clear the Connection header, or NGINX opens a fresh connection per request and you get phantom latency that looks like an upstream problem but isn’t.

`no live upstreams` and the health-check trap

no live upstreams is the one that fools people. By the time you see it, NGINX has already ejected every backend based on passive health tracking: max_fails=3 fail_timeout=30s means three failures in the fail_timeout window marks a server dead for that window. One bad deploy that 502s a few times can take the whole pool offline for 30 seconds even after the app recovers.

The fix is rarely “add more timeout.” It’s understanding that the earlier errors are the real bug. Scroll back, find the first connect() failed or upstream timed out, and fix that. The no live upstreams lines are just the aftershock. AI is handy here for the grunt work of correlating a noisy log.

Prompt: “Here are 200 lines of NGINX error.log around an incident. Group the upstream errors by failure type and by upstream address, give me a timeline, and tell me which upstream failed first and what its error was. I’m trying to find the trigger, not the cascade.”

You read the timeline it produces, then you go verify the first failing upstream by hand. The model organizes; you decide.

FastCGI / PHP-FPM: a 502 with a different shape

PHP sites usually proxy over a FastCGI socket, and the failure looks slightly different in the log: connect() to unix:/run/php/php8.3-fpm.sock failed (2: No such file or directory) or (13: Permission denied). Errno 2 means the socket path is wrong or FPM isn’t running; errno 13 means NGINX’s worker user can’t read the socket (a listen.owner/listen.group mismatch in the FPM pool config).

location ~ \.php$ {
    fastcgi_pass unix:/run/php/php8.3-fpm.sock;
    fastcgi_index index.php;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;

    # PHP equivalents of the proxy timeouts above
    fastcgi_connect_timeout 5s;
    fastcgi_read_timeout 120s;
}

If you get a 504 against PHP-FPM, also check pm.max_children in the FPM pool — an exhausted worker pool queues requests until fastcgi_read_timeout fires, and bumping the NGINX timeout just hides a capacity problem you should actually solve.

Validate before you reload — every time

This is the non-negotiable part of the loop. AI drafts the directive, you read it, but nothing reaches a worker until NGINX itself signs off on the syntax. Then reload gracefully so in-flight requests finish on the old workers.

# Syntax + semantic check — NEVER reload without this passing
sudo nginx -t

# Graceful reload: spins up new workers, drains the old ones
sudo nginx -s reload

If nginx -t fails, you caught a bad edit before it took down the site instead of after. That single check is what makes it safe to take AI-drafted config seriously — the model can be confidently wrong about a directive name or context, and nginx -t is the deterministic gate that doesn’t care how confident anyone was.

That’s the whole method: read the error.log line precisely, confirm the upstream from outside NGINX, let AI decode and draft while you verify against your real system, and gate every change behind nginx -t and a graceful reload. If you want to extend the same reviewer mindset to your config’s safety posture, see reviewing NGINX security configuration with AI. More NGINX-focused walkthroughs live on the NGINX category hub, and the diagnostic prompts I reach for are collected in the prompt library so you’re not retyping them mid-incident.

Debugging NGINX 502 Bad Gateway and 504 Gateway Timeout With AI

First, separate the two errors

Read the error.log line, don’t skim it

Confirm the upstream from the outside

Tuning timeouts for 504s — deliberately, not reflexively

`no live upstreams` and the health-check trap

FastCGI / PHP-FPM: a 502 with a different shape

Validate before you reload — every time

Download the Free 500-Prompt DevOps AI Toolkit

First, separate the two errors

Read the error.log line, don’t skim it

Confirm the upstream from the outside

Tuning timeouts for 504s — deliberately, not reflexively

no live upstreams and the health-check trap

FastCGI / PHP-FPM: a 502 with a different shape

Validate before you reload — every time

Download the Free 500-Prompt DevOps AI Toolkit

`no live upstreams` and the health-check trap