Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 9 min read

Prometheus Error Guide: 'lock DB directory: resource temporarily unavailable' Startup Failure

Fix Prometheus 'lock DB directory: resource temporarily unavailable' at startup: find and stop the second process holding the TSDB lock file before restarting.

  • #prometheus-monitoring
  • #troubleshooting
  • #errors
  • #tsdb

Exact Error Message

lock DB directory: resource temporarily unavailable is a startup-time failure. Prometheus logs it while opening its TSDB and then exits immediately — it never reaches the point of serving on port 9090:

ts=2026-06-27T09:12:44.501Z caller=main.go:1186 level=info msg="Starting TSDB ..."
ts=2026-06-27T09:12:44.512Z caller=main.go:1213 level=error msg="Error opening storage" err="opening storage failed: lock DB directory: resource temporarily unavailable"
ts=2026-06-27T09:12:44.513Z caller=main.go:1043 level=error err="opening storage failed: lock DB directory: resource temporarily unavailable"

The underlying string is the EAGAIN/EWOULDBLOCK errno text — “resource temporarily unavailable” — returned from a non-blocking flock() on the TSDB lock file. That lock file is literally named lock and lives at the root of the storage path:

/var/lib/prometheus/data/lock

If you see this, the process did not start. There is no partial degradation; the binary fails closed and systemd typically logs it as code=exited, status=1/FAILURE.

What the Error Means

On startup Prometheus opens its data directory and takes an exclusive advisory lock on the lock file using flock(LOCK_EX | LOCK_NB) — a non-blocking exclusive lock. The lock guarantees that exactly one process owns the TSDB at a time, because two writers appending to the same head block and WAL would corrupt the database.

When the lock is already held by another process, the non-blocking call returns EAGAIN, which surfaces as “resource temporarily unavailable.” Prometheus does not wait or retry; it reports opening storage failed and exits.

The critical thing to understand: flock locks are tied to the process, not the file. When a process exits — cleanly or via a crash, a kill -9, or an OOM — the kernel releases its locks automatically. So a lock file left sitting on disk after a crash is not what blocks startup. A truly orphaned lock file is harmless; Prometheus will re-acquire it on the next boot. If you are hitting this error, it is almost always because another live process is still holding the lock right now — most commonly a second or not-yet-dead Prometheus.

This is distinct from WAL corruption, which also fails under “opening storage failed” but is about damaged write-ahead-log segments rather than lock contention.

Common Causes

  • Two Prometheus processes pointing at the same --storage.tsdb.path. The classic case: a duplicate unit, a manual ./prometheus launched alongside the service, or two containers mounting the same host directory.
  • An old process that did not fully stop during a restart. systemd thinks the service stopped, but the old PID is still draining (slow shutdown, blocked on fsync) and still holds the lock when the new process starts.
  • A crashed process that is actually still alive. A wedged or zombie-parent process that never released the descriptor — rare, but it keeps the lock until reaped.
  • Running Prometheus plus a manual promtool/tsdb operation on the same dir. promtool tsdb commands and snapshot/backfill tooling can open the same data directory; doing so while the server runs collides on the lock.
  • systemd restart racing the old PID. A Restart=always policy or a fast systemctl restart can launch the new process before the kernel has torn down the old one’s file locks.
  • NFS or shared-volume lock semantics. On NFS (especially older nfs without proper lockd/flock support) advisory locks behave inconsistently, so a lock may appear held when it isn’t, or vice versa.
  • Containers sharing a host path. Two pods/containers bind-mounting the same /var/lib/prometheus/data will fight over the single lock file.

How to Reproduce the Error

Start a second Prometheus pointed at a data directory that is already in use:

# First instance is already running against /var/lib/prometheus/data
prometheus --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/data \
  --web.listen-address=:9091
level=error caller=main.go:1213 msg="Error opening storage" err="opening storage failed: lock DB directory: resource temporarily unavailable"

Note that the second instance used a different listen port (:9091) and still failed — the conflict is over the data directory lock, not the network port. This is what trips people up: they change the port, the error stays, because the contended resource is the filesystem, not the socket.

Diagnostic Commands

Every command here is a read-only inspection. None of them modify the lock file or the database.

List every Prometheus process — you are looking for more than one, or for an old PID that should be gone:

ps aux | grep '[p]rometheus'

Find exactly which process holds the lock file (this is the definitive answer):

lsof /var/lib/prometheus/data/lock
# or, if lsof is unavailable:
fuser -v /var/lib/prometheus/data/lock
COMMAND     PID       USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
prometheus 4821 prometheus    8uW  REG  259,1        0 131074 /var/lib/prometheus/data/lock

The W in the FD column confirms a write lock is held by PID 4821. Check what is bound to the metrics port:

ss -ltnp | grep 9090

Inspect the service state and recent logs:

systemctl status prometheus
journalctl -u prometheus -n 50 --no-pager

Confirm the lock file exists and who owns it (its presence alone is not the problem):

ls -la /var/lib/prometheus/data/lock

Step-by-Step Resolution

1. Identify the holder before touching anything. Run lsof /var/lib/prometheus/data/lock (or fuser). The PID it returns is the real cause. If it returns nothing, the file is orphaned and is not blocking you — your failure is something else (check for WAL corruption or a wrong path).

2. Stop the duplicate or stale process cleanly. If ps/lsof shows a second Prometheus (or an old PID from a botched restart), stop it gracefully so it can flush and release the lock:

sudo systemctl stop prometheus
# then confirm nothing is left:
ps aux | grep '[p]rometheus'

If a manually-launched instance is the culprit, send it SIGTERM and let it shut down:

sudo kill -TERM 4821      # graceful; lets it fsync and release the flock

Avoid kill -9 unless the process is genuinely wedged — SIGTERM lets Prometheus close the head block cleanly. Either way, the kernel releases the lock the instant the process dies.

3. Eliminate the duplicate launcher. If two unit files or a stray container both target the same path, remove or reconfigure one. Two Prometheus servers that must coexist need separate --storage.tsdb.path directories — never a shared one.

4. Fix restart races. If a fast systemctl restart keeps racing the old PID, add a short TimeoutStopSec and ensure ExecStop waits, or insert a brief sleep in a wrapper so the new process starts only after the old descriptor is gone. Do not paper over this with retries.

5. Start Prometheus and confirm. Once the holder is gone, start the service; it will re-acquire the lock with no manual cleanup:

sudo systemctl start prometheus
journalctl -u prometheus -n 20 --no-pager | grep -i 'server is ready'

You should not need to rm the lock file. Deleting it while a process holds the lock does nothing useful (the lock lives on the open descriptor, not the path), and deleting it when no process holds it is unnecessary.

The --storage.tsdb.no-lockfile tradeoff. This flag disables the lock entirely. It exists for edge cases like NFS where flock is unreliable, but it is dangerous: with no lock, nothing stops two processes from opening the same TSDB and corrupting it. Only use it if you have an external guarantee of single-writer (e.g. a Kubernetes StatefulSet with one replica and ReadWriteOnce), and never as a quick fix to silence this error. The right fix is to stop the second process.

Prevention and Best Practices

  • One data directory per Prometheus, always. Treat --storage.tsdb.path as exclusive. If you run multiple instances on a host, give each its own directory.
  • Never run promtool tsdb against a live server’s directory. Snapshot first (/api/v1/admin/tsdb/snapshot) and operate on the copy.
  • Set sane systemd stop timeouts so a restart fully tears down the old process before relaunching, avoiding PID races.
  • Avoid NFS for the TSDB. If you must, validate flock works on that mount, or run a single replica with a block-device PVC instead.
  • In Kubernetes, use a single-replica StatefulSet with ReadWriteOnce storage so two pods can never mount the same volume.
  • Leave the lock enabled. It is a corruption guard, not a nuisance. Treat the error as a signal that you have two writers, which is the real problem to fix.
  • opening storage failed: ... WAL — also fails under “opening storage failed,” but the cause is damaged write-ahead-log segments after an unclean shutdown, not lock contention. Different fix (repair/truncate the WAL).
  • too many open files — a separate startup/runtime failure where the TSDB cannot open enough file descriptors; raise the LimitNOFILE/ulimit rather than touching the lock.
  • resource temporarily unavailable in other contexts — this errno (EAGAIN) also appears for socket and fork limits; here it specifically means the flock on lock was already held.

Frequently Asked Questions

Should I just delete the lock file? No. The lock is held on an open file descriptor, not on the filename. Deleting it while a process holds the lock does not free anything, and if no process holds it the file is harmless. Find and stop the real holder with lsof instead.

The old process crashed — why is the lock still stuck? It almost certainly isn’t. flock locks are released by the kernel when a process exits, including on a crash or kill -9. If startup still fails, run lsof /var/lib/prometheus/data/lock; you will usually find a live process you didn’t know about (a slow-draining old instance, a duplicate unit, or a promtool job).

I changed the listen port and still get the error. Why? Because the conflict is over the data directory’s lock file, not the network port. Two Prometheus instances on different ports still collide if they share --storage.tsdb.path. Give each its own data directory.

Is --storage.tsdb.no-lockfile safe? Only when something else guarantees a single writer (e.g. a one-replica StatefulSet on ReadWriteOnce storage). Without that guarantee it invites two processes to corrupt the same TSDB. Do not use it to silence this error — stop the duplicate process.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.