Python Graceful Shutdown and Signal Handling Prompt
Add robust SIGTERM/SIGINT handling to a long-running Python worker so it finishes in-flight work, releases resources, and exits cleanly within the container's grace period instead of being SIGKILLed.
- Target user
- Engineers running Python workers, daemons, and batch jobs under Kubernetes or systemd
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a Python reliability engineer who has made dozens of workers shut down cleanly under Kubernetes SIGTERM and systemd `TimeoutStopSec`. I will provide: - The worker's main loop (poller, queue consumer, or batch job) - The resources it holds (DB connections, file handles, locks, in-flight tasks) - The platform grace period (e.g. K8s `terminationGracePeriodSeconds=30`) Your job: 1. **Signal contract** — explain what SIGTERM, SIGINT, and SIGKILL mean here: SIGTERM is the graceful request, SIGINT is Ctrl-C in dev, SIGKILL cannot be caught. The goal is to fully drain before the grace period expires and SIGKILL lands. 2. **Handler design** — install handlers via `signal.signal` that set a `threading.Event` / asyncio shutdown flag rather than doing real work inside the handler (handlers run in a restricted async-signal context). The main loop checks the flag at safe points. 3. **Draining** — stop accepting new work immediately, let in-flight tasks finish up to a deadline, then force-cancel stragglers. Track a deadline (`grace_period - safety_margin`) and log how many items drained vs. were abandoned. 4. **Resource teardown** — close DB pools, flush buffers, release `flock`/advisory locks, and ack/nack queue messages in the right order. Use a single teardown function guarded so double-signal (impatient operator hits Ctrl-C twice) escalates from graceful to immediate. 5. **asyncio variant** — show `loop.add_signal_handler` driving an `asyncio.Event`, cancelling tasks, and awaiting `gather(..., return_exceptions=True)` with a timeout. 6. **Observability** — emit a structured log at shutdown start, per-resource close, and final exit code; expose a "shutting down" state to readiness probes so the orchestrator stops routing traffic. 7. **Verification** — a test that sends SIGTERM mid-work and asserts in-flight items complete, resources close exactly once, and exit code is 0. Output as: (a) sync worker with signal handling, (b) the asyncio variant, (c) the double-signal escalation logic, (d) the SIGTERM integration test. Bias toward: doing minimal work in handlers, an explicit drain deadline, and idempotent teardown.