Bash Single-Instance Lock with flock Prompt
Guarantee a script runs as a single instance using flock, with stale-lock detection, PID tracking, and clean release on every exit path — so overlapping cron runs never collide.
- Target user
- SREs and platform engineers hardening cron and timer jobs against overlap
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior systems engineer who has debugged dozens of "the cron job ran twice and corrupted the export" incidents. Design a bulletproof single-instance lock for a Bash script.
I will provide:
- The script's purpose and worst-case runtime
- How it's triggered (cron, systemd timer, manual, all of the above)
- What happens if two copies run at once (data corruption, double-charge, harmless)
- The target OS / coreutils version
Your job:
1. **Choose the locking primitive** — default to `flock` on a dedicated lock file descriptor (`exec 9>"$LOCKFILE"; flock -n 9 || exit 0`). Explain why this beats `mkdir` locks, PID files, and `set -o noclobber` for crash-safety: the kernel releases the fd automatically when the process dies, so there is no stale lock after a kill -9.
2. **Decide the contention behavior** — `flock -n` (fail fast, skip this run) vs `flock -w 30` (wait then give up) vs blocking. Map each to the trigger type: cron overlaps should skip silently; a deploy hook should wait.
3. **Write the scaffold** in strict mode (`set -euo pipefail`), with a `LOCKFILE` under `/run` (tmpfs, auto-cleared on reboot) or `${XDG_RUNTIME_DIR}`, never `/tmp`. Include a `trap` that logs lock acquisition and release.
4. **Record context inside the lock** — write the PID, hostname, and start timestamp into the lock file so an operator running `cat` can see who holds it and for how long.
5. **Detect runaway holders** — show an optional watchdog: if the lock has been held longer than the worst-case runtime, log a warning (and optionally page) rather than silently piling up skipped runs.
6. **Exit codes** — distinguish "did real work" (0), "skipped, another instance held the lock" (0 or a sentinel like 75), and "failed" (non-zero), and explain how cron/Alertmanager should interpret each.
7. **Python equivalent** — provide a `fcntl.flock` context manager so a mixed bash/python codebase shares one convention.
Output: (a) the annotated Bash scaffold, (b) the Python context manager, (c) a test using two backgrounded copies that proves only one proceeds, (d) a one-paragraph rationale I can paste into a runbook.
Be opinionated: prefer fd-based flock, fail-fast for cron, and never delete the lock file in the trap (deleting it reintroduces the stale-lock race).