Prometheus Error Guide: 'too many open files' File Descriptor Limit
Fix the Prometheus 'too many open files' error: diagnose low ulimit, leaked connections, high target counts, and TSDB block fan-out. Raise nofile and verify limits.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #limits
Overview
too many open files is the OS error (EMFILE) Prometheus surfaces when it tries to open a file descriptor (FD) but has already hit its per-process limit (RLIMIT_NOFILE). Prometheus consumes FDs for every scrape connection, every remote-write connection, every open TSDB block file (chunks, index), the WAL, and the HTTP server’s sockets. As target counts, retention, and query load grow, the FD count rises; when it crosses the limit, scrapes fail, queries error, and compaction can stall.
You will see it in the log against many different operations:
ts=2026-06-23T14:20:11.903Z caller=scrape.go:1382 level=error scrape_pool=node msg="Scrape commit failed" err="open /prometheus/wal/00012890: too many open files"
Or affecting HTTP and scrapes:
err="Get \"http://10.0.4.9:9100/metrics\": dial tcp 10.0.4.9:9100: socket: too many open files"
It is a resource-exhaustion condition, not a corruption: nothing is broken on disk, but Prometheus can’t open new files or sockets until FDs are freed or the limit is raised. Symptoms are broad and intermittent because any FD-requiring operation can be the one that fails.
Symptoms
- Scrapes across many jobs fail simultaneously with
socket: too many open files. - TSDB operations (WAL append, compaction) error with
open ...: too many open files. - The HTTP API returns errors or refuses connections under load.
process_open_fdssits nearprocess_max_fds.
process_open_fds{job="prometheus"} / process_max_fds{job="prometheus"} > 0.9
{instance="localhost:9090"} 0.98
Common Root Causes
1. The nofile soft limit is too low
The classic cause: Prometheus inherited a default nofile (often 1024) that is far below what a real workload needs. Check the live process limit:
cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
Max open files 1024 4096 files
A soft limit of 1024 is exhausted quickly by even a few hundred targets plus TSDB files.
2. High target count driving many concurrent connections
Each scrape opens a connection; thousands of targets at a short interval keep many FDs in flight at once.
curl -s http://localhost:9090/api/v1/targets \
| jq '[.data.activeTargets[] | select(.health=="up")] | length'
8421
8,400 active targets, each scraped on a short interval, can hold thousands of concurrent sockets — easily past a low limit.
3. Many TSDB blocks / large head keeping files open
Long retention and many blocks mean many chunk/index files open for querying and compaction.
ls -1 /prometheus | grep -E '^[0-9A-Z]{26}$' | wc -l
312
Hundreds of blocks, each contributing index and chunk FDs during queries and compaction, add up alongside scrape sockets.
4. Leaked or lingering connections (CLOSE_WAIT)
A misbehaving target or proxy can leave sockets in CLOSE_WAIT, holding FDs that never get reclaimed.
ls -1 /proc/$(pgrep -x prometheus)/fd | wc -l
ss -tanp 2>/dev/null | grep "pid=$(pgrep -x prometheus)" | awk '{print $1}' | sort | uniq -c
3980
2104 CLOSE_WAIT
14 ESTAB
A large CLOSE_WAIT count means FDs are leaking and will exhaust the limit even if the workload is modest.
5. systemd LimitNOFILE not set (or overriding the file)
For a systemd-managed Prometheus, LimitNOFILE in the unit governs the limit regardless of /etc/security/limits.conf.
systemctl show prometheus -p LimitNOFILE -p LimitNOFILESoft
LimitNOFILE=1024
LimitNOFILESoft=1024
A unit pinned at 1024 ignores limits.conf entirely; the fix must go in the unit (or a drop-in).
6. Heavy concurrent query load opening block files
A burst of long-range queries opens many block files at once; combined with scrape sockets, this can spike FD usage transiently.
process_open_fds{job="prometheus"}
{instance="localhost:9090"} 4002
FD usage spiking with query concurrency (rather than steady-state) points at query-time block fan-out as the tipping factor.
Diagnostic Workflow
Step 1: Confirm FD usage against the limit
process_open_fds{job="prometheus"}
process_max_fds{job="prometheus"}
If open_fds is at or near max_fds, this is genuinely a limit problem, not an unrelated error.
Step 2: Read the live per-process limit
cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
The soft (effective) limit is what matters; a low value here is the smoking gun.
Step 3: Break down what the FDs are
ls -1 /proc/$(pgrep -x prometheus)/fd | wc -l
ss -tanp 2>/dev/null | grep "pid=$(pgrep -x prometheus)" | awk '{print $1}' | sort | uniq -c
lsof -p $(pgrep -x prometheus) 2>/dev/null | awk '{print $5}' | sort | uniq -c | sort -rn | head
Distinguish sockets (scrapes/remote-write), CLOSE_WAIT leaks, and regular files (TSDB).
Step 4: Set the limit in the right place
For systemd, edit a drop-in (not limits.conf):
systemctl edit prometheus
# [Service]
# LimitNOFILE=1048576
systemctl daemon-reload && systemctl restart prometheus
For container runtimes, raise nofile ulimits in the compose/k8s spec.
Step 5: Verify the new limit took effect
cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
Max open files 1048576 1048576 files
Confirm the soft limit reflects the change, then re-check process_open_fds / process_max_fds.
Example Root Cause Analysis
After onboarding a new cluster, Prometheus starts failing scrapes across every job with socket: too many open files, and dashboards go patchy.
Checking FD usage and the limit:
process_open_fds{job="prometheus"} / process_max_fds{job="prometheus"}
{instance="localhost:9090"} 0.99
cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
Max open files 1024 1024 files
The active target count jumped to ~8,400 after the new cluster’s service discovery kicked in, but the systemd unit still ships the distro default of 1024 FDs. The workload simply needs far more descriptors than the limit allows.
The fix raises LimitNOFILE via a systemd drop-in (the unit overrides limits.conf):
systemctl edit prometheus
# [Service]
# LimitNOFILE=1048576
systemctl daemon-reload && systemctl restart prometheus
cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
Max open files 1048576 1048576 files
With the limit at ~1M, process_open_fds settles around 12k and scrapes recover across all jobs. (Had the breakdown shown thousands of CLOSE_WAIT instead, the fix would have been the leaking target/proxy, not the limit.)
Prevention Best Practices
- Set
LimitNOFILEhigh (e.g., 1048576) in the systemd unit or container ulimits from the start; the distro default of 1024 is never enough for a real Prometheus. - Alert on
process_open_fds / process_max_fds > 0.8so you raise the limit before scrapes start failing, not after. - Watch for FD leaks: a steadily climbing
process_open_fdsat constant load, or a largeCLOSE_WAITcount, points at a misbehaving target/proxy rather than a too-low limit. - Keep target counts and retention proportionate to the instance; shard Prometheus before a single instance’s FD (and memory) needs balloon.
- Verify the limit at the live process (
/proc/<pid>/limits), sincelimits.confis ignored under systemd and easy to “fix” in the wrong place. - The free incident assistant can tell a too-low limit from an FD leak by reading the socket-state breakdown; more operational guidance is under Prometheus and monitoring.
Quick Command Reference
# FD usage vs limit (expression browser)
# process_open_fds / process_max_fds
# Live per-process limit (the soft limit is what counts)
cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
# What are the FDs? sockets vs files vs leaks
ls -1 /proc/$(pgrep -x prometheus)/fd | wc -l
ss -tanp 2>/dev/null | grep "pid=$(pgrep -x prometheus)" | awk '{print $1}' | sort | uniq -c
lsof -p $(pgrep -x prometheus) 2>/dev/null | awk '{print $5}' | sort | uniq -c | sort -rn | head
# How many active targets?
curl -s http://localhost:9090/api/v1/targets \
| jq '[.data.activeTargets[] | select(.health=="up")] | length'
# Raise the limit via systemd drop-in
systemctl edit prometheus # [Service]\nLimitNOFILE=1048576
systemctl daemon-reload && systemctl restart prometheus
process_open_fds{job="prometheus"} / process_max_fds{job="prometheus"}
Conclusion
too many open files means Prometheus hit its FD limit while opening a socket or file. Work it down:
- Confirm
process_open_fdsis atprocess_max_fds— verify it’s really a limit issue. - Read the live limit at
/proc/<pid>/limits; a soft limit of 1024 is the usual culprit. - Break down the FDs into sockets,
CLOSE_WAITleaks, and TSDB files. - Raise
LimitNOFILEin the systemd unit (or container ulimits) — notlimits.conf. - Verify the new limit took effect at the live process.
Most cases are simply a default limit that never scaled with the target count. Raise it generously, alert on the ratio, and watch for CLOSE_WAIT leaks that no limit increase will cure.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.