File Locking and Graceful Shutdown: The Two Habits That

Two of the nastiest production incidents I’ve cleaned up were caused by the same two missing safeguards. The first: a backup cron job that ran every five minutes but sometimes took six, so two copies ran at once, corrupted the destination, and nobody noticed for a week. The second: a data-migration script that got killed mid-write during a deploy and left a half-written file that the next process happily read as truth. Both are completely preventable with about three lines of code each. This is the guide to those three lines.

Problem one: scripts that overlap themselves

Any script triggered on a schedule — cron, a systemd timer, a loop — can be started again before the previous run finishes. If the work isn’t safe to run concurrently (writing a file, touching a database, holding a connection), overlapping runs corrupt state or thrash resources. The fix is a lock that only one instance can hold.

flock in Bash

flock is the standard tool and it’s beautifully simple. The cleanest pattern wraps the whole script so the lock is held for the script’s lifetime and released automatically when it exits — even on a crash:

#!/usr/bin/env bash
set -euo pipefail

exec 200>/var/run/mybackup.lock
flock -n 200 || { echo "Another instance is running; exiting."; exit 0; }

# --- real work below; the lock is held until this process exits ---
echo "Running backup..."
./do_backup.sh

Here exec 200>file opens file descriptor 200 against a lock file, and flock -n 200 tries to acquire it non-blocking. If another instance holds it, -n makes flock fail immediately and we exit cleanly. Crucially, when the script ends for any reason — normal exit, crash, kill — the OS closes the fd and releases the lock. No stale lock files to clean up, which is exactly the trap that mkdir-based or PID-file locking falls into.

A one-liner form is great for cron lines:

flock -n /var/run/mybackup.lock ./backup.sh

If backup.sh is already running, this run quietly does nothing. That single word in front of your cron command would have prevented my overlapping-backup incident outright.

Locking in Python

Python’s standard library has fcntl for the same thing:

import fcntl, sys

lock_file = open("/var/run/myjob.lock", "w")
try:
    fcntl.flock(lock_file, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
    print("Another instance is running; exiting.")
    sys.exit(0)

# ... work ...  (lock released when the process exits and the fd closes)

LOCK_EX | LOCK_NB is an exclusive, non-blocking lock — the direct equivalent of flock -n. Keep a reference to lock_file for the whole run; if it gets garbage-collected, the lock releases early. That’s a subtle bug worth knowing about.

Problem two: scripts that die mid-operation

When you deploy, restart a service, or hit Ctrl-C, the OS sends a signal — usually SIGTERM (polite “please stop”) or SIGINT (Ctrl-C). By default the process dies immediately, wherever it is. If that’s halfway through writing a file or in the middle of a batch, you’ve corrupted state. Graceful shutdown means catching the signal, finishing the current safe unit of work, cleaning up, and then exiting.

Signal handling in Bash with trap

trap runs a handler on a signal. The most valuable use is cleanup on exit — remove temp files, release resources — regardless of how the script ends:

#!/usr/bin/env bash
set -euo pipefail

workdir=$(mktemp -d)
cleanup() {
  echo "Cleaning up $workdir"
  rm -rf "$workdir"
}
trap cleanup EXIT          # runs on normal exit, errors, and signals
trap 'echo "Interrupted; finishing current item..."; STOP=1' INT TERM

STOP=0
for item in "${items[@]}"; do
  [[ $STOP -eq 1 ]] && break   # stop at a safe boundary, not mid-item
  process "$item"
done

trap cleanup EXIT is the workhorse — it fires on any exit, so your temp dir always gets removed. The second trap on INT TERM sets a flag instead of dying instantly, letting the loop finish the current item and stop at a clean boundary. That distinction — stop between units of work, not during one — is the entire point of graceful shutdown.

Signal handling in Python

Python’s signal module mirrors this. The pattern I use sets a flag the main loop checks:

import signal

shutdown = False

def handle(signum, frame):
    global shutdown
    print(f"Got {signal.Signals(signum).name}; stopping after current item.")
    shutdown = True

signal.signal(signal.SIGTERM, handle)
signal.signal(signal.SIGINT, handle)

for item in work_items:
    if shutdown:
        break
    process(item)          # never interrupted mid-item
cleanup()

For guaranteed cleanup regardless of how you exit, pair this with a try/finally or a with block around resources. The flag handles when to stop; finally handles that cleanup always runs.

Why this matters for systemd

If you run scripts as systemd services or timers — and you probably should over cron — systemd sends SIGTERM on stop and during deploys, then SIGKILL after a timeout (TimeoutStopSec). Handling SIGTERM means your service shuts down cleanly during every deploy instead of being killed mid-work. A service that ignores SIGTERM is a service that corrupts state every time you restart it. I’ve seen exactly that turn a routine deploy into a data-recovery exercise.

Put them together

Real production scripts combine both: a flock lock so only one runs, a trap/signal handler so the one that’s running shuts down cleanly, and try/finally cleanup so nothing leaks. None of it is more than a handful of lines, and together they convert a script that “works on my machine” into one that survives the chaos of a real production environment — overlapping schedules, deploys, and operators hitting Ctrl-C.

For more reliability patterns and the prompts I use to audit scripts for these exact gaps, see the Bash & Python automation guides and our prompt library.

Test locking and shutdown behavior — including sending SIGTERM mid-run — before relying on a script in production schedules or deploy pipelines.

File Locking and Graceful Shutdown: The Two Habits That Separate Hobby Scripts from Production Ones