NGINX Error Guide: 'recv() failed (104: Connection reset by peer)' from Upstream
Fix NGINX recv() failed (104: Connection reset by peer) while reading response header from upstream, caused by backend crashes, OOM kills, and stale keepalive.
- #nginx
- #troubleshooting
- #errors
- #upstream
Exact Error Message
When NGINX proxies a request and the backend abruptly drops the TCP connection, it logs an entry like this in your error.log and returns a 502 Bad Gateway to the client:
2026/06/27 15:09:33 [error] 2841#2841: *10544 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 203.0.113.7, server: app.example.com, request: "POST /api/jobs HTTP/1.1", upstream: "http://127.0.0.1:9000/api/jobs", host: "app.example.com"
The key fragments are recv() failed (104: Connection reset by peer) and while reading response header from upstream. Together they tell you exactly what happened: NGINX called recv() to read the backend’s response, and the kernel returned errno 104 (ECONNRESET).
What the Error Means
NGINX had successfully connected to the upstream and sent the request. It was waiting to read the response headers when the backend’s TCP stack sent a RST (reset) packet, tearing down the connection immediately. The 104 is the POSIX ECONNRESET errno: the peer forcibly closed the socket.
This is distinct from a graceful shutdown. There are two ways a connection ends:
- Clean close (FIN): The backend finishes (or chooses to close) and sends a TCP FIN. NGINX logs
upstream prematurely closed connection while reading response header from upstream. This usually means the backend closed an idle keepalive connection or exited normally without writing a full response. - Reset (RST): The backend’s socket is torn down hard, no clean handshake. NGINX logs
recv() failed (104: Connection reset by peer). This typically means the backend process died (crash, OOM kill, segfault), was killed by its own timeout/limit logic, or a stale keepalive socket was reused after the kernel had already discarded the connection.
Both produce a 502, but the cause and fix differ. A RST points at something violent happening to the connection, not a polite goodbye.
Common Causes
- Backend worker crashed, was OOM-killed, or segfaulted. PHP-FPM, Gunicorn, a Node process, or a Java app died mid-response. The kernel resets any open sockets the dead process owned. OOM kills are the most common silent culprit.
- Stale keepalive connection reused. NGINX keeps upstream connections alive and reuses them. If the backend closed its side (idle timeout,
max_requests, restart) but NGINX still has the socket in its pool, the next write triggers a RST. This is the classic case whenproxy_http_version 1.1andproxy_set_header Connection ""are missing. - Backend request or timeout limit killed the connection. Gunicorn’s
--timeout, PHP-FPM’srequest_terminate_timeout, or a Java container’s request limit can kill the worker handling a slow request, resetting the socket before a response is sent. - Request body too large for the upstream. The backend (or an app-level limit like
client_max_body_sizeon a second proxy, or a framework body cap) rejects an oversized upload by closing the connection hard instead of returning a 413. - MTU / network resets. Path MTU issues, a firewall or load balancer with an aggressive idle timeout, or a NAT table eviction can inject a RST on a connection NGINX believed was open. Common across container/overlay networks.
How to Reproduce the Error
The cleanest reproduction is killing the backend mid-request. With a Gunicorn or PHP-FPM app behind NGINX, send a request that the backend starts handling, then kill the worker:
# Terminal 1: send a slow request through NGINX
curl -s -X POST http://app.example.com/api/jobs -d '{"sleep":10}'
# Terminal 2: hard-kill the backend worker while the request is in flight
pkill -9 -f 'gunicorn: worker'
SIGKILL (-9) prevents a clean shutdown, so the kernel resets the open socket and NGINX logs recv() failed (104: Connection reset by peer). To reproduce the stale-keepalive variant, set a very short keepalive on the backend and a long one in NGINX, then send two requests spaced just beyond the backend’s idle timeout: the second reuses a dead socket and resets.
Diagnostic Commands
Start by confirming the NGINX config is valid and inspect the upstream/keepalive settings (all read-only):
# Validate config syntax
sudo nginx -t
# Dump the full effective config and check keepalive / version / Connection headers
sudo nginx -T 2>/dev/null | grep -nE 'proxy_http_version|keepalive|Connection|upstream|proxy_pass'
Hit the backend directly, bypassing NGINX, to see whether the backend itself is healthy:
# Talk to the upstream directly (adjust host:port to your proxy_pass target)
curl -sv http://127.0.0.1:9000/api/jobs -X POST -d '{"sleep":1}'
Confirm the backend is actually listening and look at connection states:
# Is the upstream port listening, and which process owns it?
ss -ltnp | grep ':9000'
# Look for connections stuck in CLOSE-WAIT / lots of resets toward the backend
ss -tan | grep ':9000'
Check logs on both sides, and crucially the kernel OOM killer:
# NGINX service + the backend service logs
journalctl -u nginx --since '15 min ago' --no-pager
journalctl -u gunicorn --since '15 min ago' --no-pager # or php-fpm, your-app.service
# Backend application/error logs (read)
tail -n 100 /var/log/php-fpm/error.log
tail -n 100 /var/log/gunicorn/error.log
# The smoking gun for OOM kills
dmesg | grep -i 'oom\|killed process'
If dmesg shows Out of memory: Killed process ... (gunicorn) lines that line up with the 502 timestamps, you have your answer: the backend is being OOM-killed, and the RST is a symptom.
Step-by-Step Resolution
-
Correlate timestamps. Match the
error.log502 times againstdmesgOOM lines and the backend’s own logs. If the backend logged a crash, traceback, or was OOM-killed at that instant, fix the backend, not NGINX. -
If it is a crash or OOM kill: Raise the memory limit (or container
memorycgroup), reduce per-worker memory (fewer threads, smaller caches), or add workers/instances. For PHP-FPM checkpm.max_childrenandmemory_limit; for Gunicorn watch worker memory growth and consider--max-requestswith--max-requests-jitterto recycle leaky workers gracefully. -
If it is stale keepalive reuse: This is the most common and most fixable case. In the
location(orserver) block that proxies to the upstream, ensure HTTP/1.1 and an emptyConnectionheader so connections are reused correctly and not poisoned:proxy_http_version 1.1; proxy_set_header Connection "";And in the
upstreamblock, set akeepalivepool size (e.g.keepalive 32;). Withoutproxy_http_version 1.1plusConnection "", NGINX sendsConnection: closesemantics or reuses sockets the backend already closed, producing resets. Make the NGINX upstream keepalive timeout shorter than the backend’s idle timeout so NGINX retires sockets first. -
If a request/timeout limit is killing workers: Align timeouts. Raise the backend’s per-request timeout (Gunicorn
--timeout, PHP-FPMrequest_terminate_timeout) for genuinely slow endpoints, or offload long work to a queue. Make sure NGINXproxy_read_timeoutis consistent with the backend. -
If oversized request bodies are the trigger: Set a sane
client_max_body_sizein NGINX so it returns a clean 413 before forwarding, and align the backend’s body limit so it does not close the socket hard. -
Apply and reload. After editing the config, validate and reload without dropping connections:
sudo nginx -t && sudo systemctl reload nginx -
Verify. Re-run the direct
curland a request through NGINX, then watcherror.logto confirm the resets are gone.
Prevention and Best Practices
- Always pair upstream keepalive with
proxy_http_version 1.1;andproxy_set_header Connection "";. This single fix eliminates the most frequent reset cause. - Keep NGINX keepalive timeout below the backend’s idle timeout so NGINX never reuses a socket the backend has already closed.
- Right-size memory and recycle workers. Use
--max-requests(Gunicorn) or sensiblepm.max_children(PHP-FPM) to bound memory and avoid OOM kills. - Monitor for OOM events. Alert on
dmesgOOM lines and on 502 rate spikes so resets surface before users complain. See the /dashboard/incident-response/ tooling for triaging upstream 502 floods. - Log timestamps consistently across NGINX and the backend so correlation takes seconds, not minutes.
Related Errors
upstream prematurely closed connection while reading response header from upstream— the clean-FIN cousin of this error; the backend closed gracefully rather than resetting. Usually keepalive or a backend exit without a full response.connect() failed (111: Connection refused) while connecting to upstream— the backend is not listening at all (down or wrong port), as opposed to dying mid-response.upstream timed out (110: Connection timed out) while reading response header from upstream— the backend is alive but too slow; tuneproxy_read_timeoutand the backend’s processing.- For more NGINX guides, see /categories/nginx/.
Frequently Asked Questions
Is recv() failed (104) the same as upstream prematurely closed connection?
No. recv() failed (104: Connection reset by peer) means the backend sent a TCP RST — a hard, abnormal teardown, typically from a crash, OOM kill, or a stale socket. upstream prematurely closed connection means a clean TCP FIN — a graceful close. The reset version almost always points at the backend dying or a poisoned keepalive socket.
Why does adding proxy_http_version 1.1; and Connection ""; fix the resets?
Upstream keepalive only works correctly over HTTP/1.1 with no explicit Connection: close. Without these directives, NGINX either closes connections it should reuse or reuses ones the backend already abandoned, so the next write lands on a dead socket and the kernel replies with a RST. The two directives make NGINX manage the keepalive pool the way the backend expects.
How do I confirm the OOM killer is the cause?
Run dmesg | grep -i oom and look for Out of memory: Killed process lines, then match their timestamps to the 502s in error.log. If they line up with your backend process name, the kernel is killing workers under memory pressure and the reset is just the visible symptom.
Can a network device cause this without the backend crashing?
Yes. A firewall, NAT gateway, load balancer, or container overlay with an aggressive idle timeout can evict a connection and inject a RST while both NGINX and the backend believe it is still open. Check ss -tan for unexpected resets and align idle timeouts end to end, including any intermediate proxies.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.