Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for NGINX By James Joyner IV · · 9 min read

NGINX Error Guide: 'no live upstreams while connecting to upstream'

Fix the NGINX no live upstreams error when every upstream block member is ejected by max_fails and fail_timeout passive health checks, causing 502s.

  • #nginx
  • #troubleshooting
  • #errors
  • #upstream

Exact Error Message

When NGINX is configured as a reverse proxy or load balancer and all backends in an upstream{} block have been marked unavailable, you will see this in your error log:

2026/06/27 09:31:44 [error] 2841#2841: *10590 no live upstreams while connecting to upstream, client: 203.0.113.7, server: app.example.com, request: "GET / HTTP/1.1", upstream: "http://backend/", host: "app.example.com"

The client receives an HTTP 502 Bad Gateway. The phrase no live upstreams is the key signal: NGINX is not reporting a connection refused or a timeout to a single backend, it is reporting that there are zero eligible backends left in the group to even attempt a connection.

What the Error Means

NGINX upstream groups use passive health checks in the open-source build. Every time NGINX proxies a request to a backend server and that attempt fails (connection refused, timeout, or a response code listed in proxy_next_upstream), NGINX records a failure against that specific server. Two directives control the bookkeeping:

  • max_fails (default 1): the number of failed attempts within the fail_timeout window that will cause NGINX to mark the server as unavailable.
  • fail_timeout (default 10s): both the window in which failures are counted and the length of time the server stays ejected once it crosses max_fails.

When every server in the group is in the ejected state at the same moment, NGINX has nowhere to send the request. Rather than wait, it immediately returns 502 and logs no live upstreams. This is different from a single-backend failure: with one healthy peer left, NGINX would simply retry against it.

One important nuance: an ejected server is not gone forever. After fail_timeout elapses, NGINX will route a single request to that server to retest it. If that probe succeeds the server rejoins the rotation; if it fails, the server is ejected again for another fail_timeout.

Common Causes

  • All backends genuinely down. A bad deploy, an OOM-killed app, or a crashed container means every peer refuses connections. This is the most common and most honest cause.
  • Aggressive passive health-check tuning. A low max_fails combined with a short fail_timeout can eject every server during a brief latency spike, even though the apps are healthy.
  • proxy_next_upstream counting normal responses as failures. If you add http_500 or non_idempotent and your app returns 500s under load, NGINX treats those as upstream failures and ejects the peers.
  • DNS resolution of upstream hostnames. If you use hostnames in the upstream block and the names resolve to addresses that are stale, unreachable, or empty, every connection attempt fails and the group empties out.
  • Flapping. Backends that go up and down repeatedly (slow startup, failing liveness, GC pauses) cross max_fails faster than they recover, so the group is empty more often than not.

How to Reproduce the Error

Create a minimal upstream group pointing at two backends and stop both of them. With tight thresholds the error appears almost instantly:

upstream backend {
    server 10.0.0.11:8080 max_fails=2 fail_timeout=15s;
    server 10.0.0.12:8080 max_fails=2 fail_timeout=15s;
}

server {
    listen 80;
    server_name app.example.com;
    location / {
        proxy_pass http://backend;
        proxy_next_upstream error timeout http_502 http_503;
    }
}

Stop both app processes, then send a few requests. The first attempts eject each peer; once both are ejected, the next request logs no live upstreams and returns 502.

Diagnostic Commands

Start by confirming the config is valid and dumping the effective upstream block, including any defaults NGINX applied:

sudo nginx -t
sudo nginx -T | grep -A20 'upstream backend'

Check that NGINX itself is running and listening, and review recent service logs:

journalctl -u nginx --since "15 min ago" --no-pager
ss -ltnp | grep nginx

Now test each backend directly, bypassing NGINX entirely. This tells you whether the peers are actually up:

curl -I --max-time 3 http://10.0.0.11:8080/
curl -I --max-time 3 http://10.0.0.12:8080/

If your application exposes a health endpoint, hit it on each backend so you see the app’s own view of health, not just whether the socket accepts:

curl -sS --max-time 3 http://10.0.0.11:8080/healthz
curl -sS --max-time 3 http://10.0.0.12:8080/healthz

If the upstream uses hostnames rather than IPs, resolve them to confirm DNS returns the addresses you expect:

getent hosts backend-1.svc.internal
dig +short backend-2.svc.internal

A curl -I that returns 200 while NGINX still logs no live upstreams is the classic signature of over-aggressive ejection or proxy_next_upstream misconfiguration: the backends are fine, but NGINX has marked them dead.

Step-by-Step Resolution

1. Confirm whether the backends are truly down. Use the direct curl -I commands above. If they fail, this is an application or infrastructure outage, not an NGINX tuning problem. Restart or roll back the backends and the upstream group recovers automatically after fail_timeout.

2. If backends respond directly, loosen the passive health check. Defaults of max_fails=1 are brittle. Give each peer more tolerance so a single blip does not eject it:

upstream backend {
    server 10.0.0.11:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.12:8080 max_fails=3 fail_timeout=30s;
}

Here a server must fail three times in a 30-second window before it is ejected, and it is retested after 30 seconds.

3. Stop counting application errors as upstream failures. Review proxy_next_upstream. Including http_500 or http_503 means a legitimate app error ejects the peer. Narrow it to genuine connection problems:

proxy_next_upstream error timeout;
proxy_next_upstream_tries 2;

4. Fix DNS for hostname-based upstreams. The open-source upstream block resolves hostnames only at startup/reload. If your backend IPs change (containers, autoscaling), add a resolver and use a variable in proxy_pass so NGINX re-resolves at request time, or move to NGINX Plus dynamic upstreams. At minimum, reload NGINX after the backing IPs change so it picks up fresh records.

5. Keep one peer available during partial outages. Mark a known-stable host as a backup so the group is never completely empty:

upstream backend {
    server 10.0.0.11:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.12:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.99:8080 backup;
}

6. Validate and reload. Always test before applying:

sudo nginx -t && sudo systemctl reload nginx

A reload is graceful, so in-flight requests are not dropped.

Prevention and Best Practices

  • Tune max_fails and fail_timeout for your real failure profile. A web app with occasional slow requests needs more tolerance than a static cache. Avoid max_fails=1 in production.
  • Be conservative with proxy_next_upstream. Only retry on error timeout unless you have a specific reason. Never blindly add http_500.
  • Keep upstream targets stable. Use service discovery or a fixed VIP rather than ephemeral container IPs in the static upstream block.
  • Add active health checks if you can run NGINX Plus, or front your pool with a load balancer that does active probing. Passive checks only react after real user requests fail.
  • Alert on the log line. Wire an alert on no live upstreams so you hear about an empty pool before users do. If you want help building that detection and a runbook, the incident response dashboard can turn this error into actionable steps.
  • Match fail_timeout to your backend recovery time. If apps take 20 seconds to restart, a 10-second fail_timeout will retest too early and re-eject them, producing flapping.
  • 502 Bad Gateway with connect() failed (111: Connection refused) — a single backend refused a connection; the group is not yet empty.
  • upstream timed out (110: Connection timed out) — backend accepted but did not respond in time; repeated occurrences feed max_fails.
  • no resolver defined to resolve <host> — you used a variable in proxy_pass without a resolver directive.
  • 504 Gateway Time-out — distinct from this error; the backend was reachable but exceeded proxy_read_timeout.

For more NGINX troubleshooting, browse the NGINX category.

Frequently Asked Questions

Why do I get 502 instantly instead of NGINX waiting for a backend?

Because the group has no eligible peers. NGINX does not block waiting for an ejected server to recover within a request. When every peer is marked unavailable it short-circuits to 502 and logs no live upstreams. The ejected servers are only retested on a new request after fail_timeout elapses.

My backends are clearly up, so why does NGINX think they are down?

Passive health checks marked them ejected based on earlier failures. A low max_fails, a short fail_timeout, or counting http_500/http_503 in proxy_next_upstream can all eject healthy peers. Confirm with a direct curl -I to each backend, then loosen the thresholds.

How long until NGINX uses an ejected backend again?

After fail_timeout seconds, NGINX sends one probe request to the ejected server. If it succeeds, the server rejoins the rotation; if it fails, the server is ejected for another fail_timeout. There is no separate recovery timer in the open-source build.

Does reloading NGINX clear the ejected state?

Yes. sudo systemctl reload nginx rebuilds the upstream groups, resetting all peers to the available state and re-resolving any hostnames. It is a useful fast recovery step once the underlying backends are healthy again.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.