Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 9 min read

Prometheus Error Guide: 'too many open files' File Descriptor Limit

Fix the Prometheus 'too many open files' error: diagnose low ulimit, leaked connections, high target counts, and TSDB block fan-out. Raise nofile and verify limits.

  • #prometheus-monitoring
  • #troubleshooting
  • #errors
  • #limits

Overview

too many open files is the OS error (EMFILE) Prometheus surfaces when it tries to open a file descriptor (FD) but has already hit its per-process limit (RLIMIT_NOFILE). Prometheus consumes FDs for every scrape connection, every remote-write connection, every open TSDB block file (chunks, index), the WAL, and the HTTP server’s sockets. As target counts, retention, and query load grow, the FD count rises; when it crosses the limit, scrapes fail, queries error, and compaction can stall.

You will see it in the log against many different operations:

ts=2026-06-23T14:20:11.903Z caller=scrape.go:1382 level=error scrape_pool=node msg="Scrape commit failed" err="open /prometheus/wal/00012890: too many open files"

Or affecting HTTP and scrapes:

err="Get \"http://10.0.4.9:9100/metrics\": dial tcp 10.0.4.9:9100: socket: too many open files"

It is a resource-exhaustion condition, not a corruption: nothing is broken on disk, but Prometheus can’t open new files or sockets until FDs are freed or the limit is raised. Symptoms are broad and intermittent because any FD-requiring operation can be the one that fails.

Symptoms

  • Scrapes across many jobs fail simultaneously with socket: too many open files.
  • TSDB operations (WAL append, compaction) error with open ...: too many open files.
  • The HTTP API returns errors or refuses connections under load.
  • process_open_fds sits near process_max_fds.
process_open_fds{job="prometheus"} / process_max_fds{job="prometheus"} > 0.9
{instance="localhost:9090"}  0.98

Common Root Causes

1. The nofile soft limit is too low

The classic cause: Prometheus inherited a default nofile (often 1024) that is far below what a real workload needs. Check the live process limit:

cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
Max open files            1024                 4096                 files

A soft limit of 1024 is exhausted quickly by even a few hundred targets plus TSDB files.

2. High target count driving many concurrent connections

Each scrape opens a connection; thousands of targets at a short interval keep many FDs in flight at once.

curl -s http://localhost:9090/api/v1/targets \
  | jq '[.data.activeTargets[] | select(.health=="up")] | length'
8421

8,400 active targets, each scraped on a short interval, can hold thousands of concurrent sockets — easily past a low limit.

3. Many TSDB blocks / large head keeping files open

Long retention and many blocks mean many chunk/index files open for querying and compaction.

ls -1 /prometheus | grep -E '^[0-9A-Z]{26}$' | wc -l
312

Hundreds of blocks, each contributing index and chunk FDs during queries and compaction, add up alongside scrape sockets.

4. Leaked or lingering connections (CLOSE_WAIT)

A misbehaving target or proxy can leave sockets in CLOSE_WAIT, holding FDs that never get reclaimed.

ls -1 /proc/$(pgrep -x prometheus)/fd | wc -l
ss -tanp 2>/dev/null | grep "pid=$(pgrep -x prometheus)" | awk '{print $1}' | sort | uniq -c
3980
   2104 CLOSE_WAIT
     14 ESTAB

A large CLOSE_WAIT count means FDs are leaking and will exhaust the limit even if the workload is modest.

5. systemd LimitNOFILE not set (or overriding the file)

For a systemd-managed Prometheus, LimitNOFILE in the unit governs the limit regardless of /etc/security/limits.conf.

systemctl show prometheus -p LimitNOFILE -p LimitNOFILESoft
LimitNOFILE=1024
LimitNOFILESoft=1024

A unit pinned at 1024 ignores limits.conf entirely; the fix must go in the unit (or a drop-in).

6. Heavy concurrent query load opening block files

A burst of long-range queries opens many block files at once; combined with scrape sockets, this can spike FD usage transiently.

process_open_fds{job="prometheus"}
{instance="localhost:9090"}  4002

FD usage spiking with query concurrency (rather than steady-state) points at query-time block fan-out as the tipping factor.

Diagnostic Workflow

Step 1: Confirm FD usage against the limit

process_open_fds{job="prometheus"}
process_max_fds{job="prometheus"}

If open_fds is at or near max_fds, this is genuinely a limit problem, not an unrelated error.

Step 2: Read the live per-process limit

cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'

The soft (effective) limit is what matters; a low value here is the smoking gun.

Step 3: Break down what the FDs are

ls -1 /proc/$(pgrep -x prometheus)/fd | wc -l
ss -tanp 2>/dev/null | grep "pid=$(pgrep -x prometheus)" | awk '{print $1}' | sort | uniq -c
lsof -p $(pgrep -x prometheus) 2>/dev/null | awk '{print $5}' | sort | uniq -c | sort -rn | head

Distinguish sockets (scrapes/remote-write), CLOSE_WAIT leaks, and regular files (TSDB).

Step 4: Set the limit in the right place

For systemd, edit a drop-in (not limits.conf):

systemctl edit prometheus
# [Service]
# LimitNOFILE=1048576
systemctl daemon-reload && systemctl restart prometheus

For container runtimes, raise nofile ulimits in the compose/k8s spec.

Step 5: Verify the new limit took effect

cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
Max open files            1048576              1048576              files

Confirm the soft limit reflects the change, then re-check process_open_fds / process_max_fds.

Example Root Cause Analysis

After onboarding a new cluster, Prometheus starts failing scrapes across every job with socket: too many open files, and dashboards go patchy.

Checking FD usage and the limit:

process_open_fds{job="prometheus"} / process_max_fds{job="prometheus"}
{instance="localhost:9090"}  0.99
cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
Max open files            1024                 1024                 files

The active target count jumped to ~8,400 after the new cluster’s service discovery kicked in, but the systemd unit still ships the distro default of 1024 FDs. The workload simply needs far more descriptors than the limit allows.

The fix raises LimitNOFILE via a systemd drop-in (the unit overrides limits.conf):

systemctl edit prometheus
# [Service]
# LimitNOFILE=1048576
systemctl daemon-reload && systemctl restart prometheus
cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'
Max open files            1048576              1048576              files

With the limit at ~1M, process_open_fds settles around 12k and scrapes recover across all jobs. (Had the breakdown shown thousands of CLOSE_WAIT instead, the fix would have been the leaking target/proxy, not the limit.)

Prevention Best Practices

  • Set LimitNOFILE high (e.g., 1048576) in the systemd unit or container ulimits from the start; the distro default of 1024 is never enough for a real Prometheus.
  • Alert on process_open_fds / process_max_fds > 0.8 so you raise the limit before scrapes start failing, not after.
  • Watch for FD leaks: a steadily climbing process_open_fds at constant load, or a large CLOSE_WAIT count, points at a misbehaving target/proxy rather than a too-low limit.
  • Keep target counts and retention proportionate to the instance; shard Prometheus before a single instance’s FD (and memory) needs balloon.
  • Verify the limit at the live process (/proc/<pid>/limits), since limits.conf is ignored under systemd and easy to “fix” in the wrong place.
  • The free incident assistant can tell a too-low limit from an FD leak by reading the socket-state breakdown; more operational guidance is under Prometheus and monitoring.

Quick Command Reference

# FD usage vs limit (expression browser)
# process_open_fds / process_max_fds

# Live per-process limit (the soft limit is what counts)
cat /proc/$(pgrep -x prometheus)/limits | grep -i 'open files'

# What are the FDs? sockets vs files vs leaks
ls -1 /proc/$(pgrep -x prometheus)/fd | wc -l
ss -tanp 2>/dev/null | grep "pid=$(pgrep -x prometheus)" | awk '{print $1}' | sort | uniq -c
lsof -p $(pgrep -x prometheus) 2>/dev/null | awk '{print $5}' | sort | uniq -c | sort -rn | head

# How many active targets?
curl -s http://localhost:9090/api/v1/targets \
  | jq '[.data.activeTargets[] | select(.health=="up")] | length'

# Raise the limit via systemd drop-in
systemctl edit prometheus   # [Service]\nLimitNOFILE=1048576
systemctl daemon-reload && systemctl restart prometheus
process_open_fds{job="prometheus"} / process_max_fds{job="prometheus"}

Conclusion

too many open files means Prometheus hit its FD limit while opening a socket or file. Work it down:

  1. Confirm process_open_fds is at process_max_fds — verify it’s really a limit issue.
  2. Read the live limit at /proc/<pid>/limits; a soft limit of 1024 is the usual culprit.
  3. Break down the FDs into sockets, CLOSE_WAIT leaks, and TSDB files.
  4. Raise LimitNOFILE in the systemd unit (or container ulimits) — not limits.conf.
  5. Verify the new limit took effect at the live process.

Most cases are simply a default limit that never scaled with the target count. Raise it generously, alert on the ratio, and watch for CLOSE_WAIT leaks that no limit increase will cure.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.