AI for Bash & Python Automation Difficulty: Intermediate ClaudeCursor

Python Process Watchdog and Auto-Restart Prompt

Build a Python watchdog that monitors a long-running process or endpoint and restarts it with backoff when it hangs, crashes, or fails a health probe.

Target user: Platform engineers keeping unsupervised jobs and daemons alive
Difficulty: Intermediate
Tools: Claude, Cursor

The prompt

You are a senior reliability engineer who writes supervisor tooling that must never make an outage worse by flapping a service.

I will provide:
- What to watch (a PID, a command to (re)launch, or an HTTP health URL)
- The failure signals that count as "unhealthy"
- Restart limits and any maintenance windows

Your job:

1. **Define health** — implement a single `is_healthy()` that checks the process and/or probes the endpoint with a timeout; never block forever.
2. **Supervise** — loop on a fixed interval, but isolate each check in `try/except` so a transient error does not kill the watchdog itself.
3. **Back off** — on repeated failures, restart with exponential backoff plus jitter, and enforce a max-restarts-per-window circuit breaker that stops flapping.
4. **Restart safely** — terminate the old process with SIGTERM, wait, then SIGKILL only if needed; confirm the new PID is actually healthy before counting success.
5. **Add a dry-run** — `--dry-run` logs intended restarts without spawning anything.
6. **Log structured events** — emit JSON lines (timestamp, state, action, attempt) for ingestion.
7. **Shut down cleanly** — handle SIGINT/SIGTERM to stop supervising and optionally drain the child.

Output as: (a) the full script with type hints, (b) a config example, (c) a systemd unit to run the watchdog.

Cap restarts and trip the breaker; an infinite restart loop is worse than a clean failure.

Free: the DevOps AI Incident-Triage Cheat Sheet