Grafana Error Guide: 'too many open files'

Overview

Every open network socket, log file, database handle, and plugin file counts against the Grafana process’s file-descriptor (FD) limit. When Grafana exceeds the nofile ulimit, the kernel refuses new descriptors and operations that need one — accepting HTTP connections, opening data-source connections, writing logs — fail with too many open files. Under load this looks like Grafana intermittently refusing requests or losing its data sources.

The literal errors you will see:

logger=context error="accept tcp [::]:3000: accept4: too many open files"

dial tcp 10.0.0.20:9090: socket: too many open files

sqlite: unable to open database file: too many open files

It occurs under connection load or after a slow FD leak: many concurrent users, many data-source queries, or per-request connections that aren’t being closed.

Symptoms

Grafana intermittently returns 502/errors; log shows accept4: too many open files.
Data-source queries fail with socket: too many open files.
The process’s open-FD count sits at or near its limit.
Restarting Grafana fixes it temporarily, then it degrades again (leak).

PID=$(pgrep -x grafana-server); ls /proc/$PID/fd | wc -l
cat /proc/$PID/limits | grep -i "open files"

1021
Max open files            1024                 1024                 files

Common Root Causes

1. Default ulimit too low for the workload

A distro default of 1024 open files is easily exhausted by a busy Grafana with many users and data sources.

cat /proc/$(pgrep -x grafana-server)/limits | grep 'open files'

2. systemd unit not raising LimitNOFILE

The service must set LimitNOFILE; a shell-level ulimit doesn’t apply to a systemd-managed process.

systemctl show grafana-server -p LimitNOFILE

LimitNOFILE=1024

3. Container runtime FD limit

In Kubernetes/Docker the container’s nofile limit applies, independent of the host.

kubectl -n monitoring exec deploy/grafana -- sh -c 'cat /proc/1/limits | grep "open files"'

4. A genuine FD leak

A misbehaving plugin, a data source that opens per-query connections without closing, or the image renderer can leak descriptors so the count climbs steadily until exhaustion.

5. Excessive concurrent connections / keep-alive

Very high concurrency (dashboards with many live panels, alerting fan-out) opens many sockets at once.

Diagnostic Workflow

Step 1: Measure current FD usage vs. the limit

PID=$(pgrep -x grafana-server)
echo "open: $(ls /proc/$PID/fd | wc -l)"
grep 'open files' /proc/$PID/limits

Step 2: See what the FDs are

sudo ls -l /proc/$PID/fd | awk '{print $NF}' | sed 's/[0-9]*$//' | sort | uniq -c | sort -rn | head
sudo lsof -p $PID 2>/dev/null | awk '{print $5}' | sort | uniq -c | sort -rn | head

A large, growing count of socket: or IP entries points at connection leakage.

Step 3: Check whether it’s a leak (trend over time)

for i in 1 2 3; do ls /proc/$PID/fd | wc -l; sleep 30; done

A monotonically climbing count under steady load indicates a leak, not just a low limit.

Step 4: Raise the systemd limit

# /etc/systemd/system/grafana-server.service.d/override.conf
[Service]
LimitNOFILE=65536

systemctl daemon-reload
systemctl restart grafana-server
systemctl show grafana-server -p LimitNOFILE

Step 5: Raise the container limit (Kubernetes/Docker)

# Docker Compose
services:
  grafana:
    ulimits:
      nofile:
        soft: 65536
        hard: 65536

For Kubernetes, set the node/container nofile via the runtime or a securityContext/sysctl per your platform, then verify inside the pod.

Example Root Cause Analysis

An on-call sees Grafana returning intermittent 502s during business hours. The log:

logger=context error="accept tcp [::]:3000: accept4: too many open files"

FD usage sits at the ceiling:

PID=$(pgrep -x grafana-server); ls /proc/$PID/fd | wc -l; grep 'open files' /proc/$PID/limits

1024
Max open files            1024                 1024                 files

lsof shows ~900 sockets to a Prometheus data source — the org grew and many users now keep heavy dashboards open. It’s not a leak (count is stable at the limit under load), just an outgrown default. systemctl show confirms LimitNOFILE=1024.

Fix: raise the systemd limit:

# /etc/systemd/system/grafana-server.service.d/override.conf
[Service]
LimitNOFILE=65536

systemctl daemon-reload && systemctl restart grafana-server

FD usage now peaks around 2–3k with plenty of headroom and the 502s stop. Root cause: the default 1024 ulimit was too low for the grown connection load — a limit increase, not a leak fix.

Prevention Best Practices

Set LimitNOFILE explicitly (e.g. 65536) in the systemd unit / container ulimits; don’t rely on distro defaults.
Monitor open FDs vs. the limit and alert well before exhaustion (e.g. at 80%).
Distinguish leak from load: a steady climb under constant traffic is a leak — capture lsof and check plugin/renderer/data-source versions.
Keep plugins and the image renderer up to date; leaks are often fixed upstream.
Use connection pooling settings on SQL data sources to bound concurrent connections.
See more Grafana guides and the sibling OOMKilled guide.

Quick Command Reference

# Current usage vs limit
PID=$(pgrep -x grafana-server); ls /proc/$PID/fd | wc -l
grep 'open files' /proc/$PID/limits
systemctl show grafana-server -p LimitNOFILE

# What are the FDs?
sudo lsof -p $PID | awk '{print $5}' | sort | uniq -c | sort -rn | head

# Leak check (trend)
for i in 1 2 3; do ls /proc/$PID/fd | wc -l; sleep 30; done

# Raise systemd limit
#  /etc/systemd/system/grafana-server.service.d/override.conf
#  [Service]
#  LimitNOFILE=65536
systemctl daemon-reload && systemctl restart grafana-server

# In-container check
kubectl -n monitoring exec deploy/grafana -- sh -c 'cat /proc/1/limits | grep "open files"'

Conclusion

too many open files means Grafana hit its file-descriptor ceiling and the kernel is refusing new sockets/handles. Typical root causes:

A default nofile ulimit (often 1024) too low for the workload.
The systemd unit not setting LimitNOFILE (shell ulimit doesn’t apply).
A container runtime nofile limit in Kubernetes/Docker.
A genuine FD leak from a plugin, data source, or the renderer.
Very high concurrent connection load.

Measure open FDs vs. the limit and check the trend first — flat-at-limit under load means raise LimitNOFILE; a steady climb means hunt the leak.

Grafana Error Guide: 'too many open files' — File Descriptor & ulimit Limits