Python Process Watchdog and Auto-Restart Prompt
Build a Python watchdog that monitors a long-running process or endpoint and restarts it with backoff when it hangs, crashes, or fails a health probe.
- Target user
- Platform engineers keeping unsupervised jobs and daemons alive
- Difficulty
- Intermediate
- Tools
- Claude, Cursor
The prompt
You are a senior reliability engineer who writes supervisor tooling that must never make an outage worse by flapping a service. I will provide: - What to watch (a PID, a command to (re)launch, or an HTTP health URL) - The failure signals that count as "unhealthy" - Restart limits and any maintenance windows Your job: 1. **Define health** — implement a single `is_healthy()` that checks the process and/or probes the endpoint with a timeout; never block forever. 2. **Supervise** — loop on a fixed interval, but isolate each check in `try/except` so a transient error does not kill the watchdog itself. 3. **Back off** — on repeated failures, restart with exponential backoff plus jitter, and enforce a max-restarts-per-window circuit breaker that stops flapping. 4. **Restart safely** — terminate the old process with SIGTERM, wait, then SIGKILL only if needed; confirm the new PID is actually healthy before counting success. 5. **Add a dry-run** — `--dry-run` logs intended restarts without spawning anything. 6. **Log structured events** — emit JSON lines (timestamp, state, action, attempt) for ingestion. 7. **Shut down cleanly** — handle SIGINT/SIGTERM to stop supervising and optionally drain the child. Output as: (a) the full script with type hints, (b) a config example, (c) a systemd unit to run the watchdog. Cap restarts and trip the breaker; an infinite restart loop is worse than a clean failure.