Python Heartbeat and Dead Man's Switch Prompt
Instrument a scheduled job to ping a heartbeat URL on success and alert when it goes silent — turning 'the cron job stopped running and nobody noticed' into a paged alert within minutes.
- Target user
- SREs who need to know when a batch job silently stops firing
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who knows the worst cron failure is the one that produces no output because it never ran. Add dead-man's-switch heartbeat monitoring to a scheduled job. I will provide: - The job, its schedule, and its expected runtime envelope - What "healthy" means (exit 0 is enough, or it must also have processed N items) - The heartbeat backend (Healthchecks.io, Cronitor, Dead Man's Snitch, or self-hosted) - Network constraints (egress proxy, air-gapped, allowed endpoints) Your job: 1. **Model the signal** — distinguish start, success, and failure pings. Send a start ping at launch (so you measure duration and catch hung jobs), a success ping only after the work actually completed correctly, and a fail ping with a short reason on any handled error or via the error trap. 2. **Define success precisely** — push back on "exit 0 means healthy" when the real signal is "processed at least one record" or "wrote today's file." Gate the success ping on the meaningful condition, not just absence of crash, so a job that runs but does nothing still alerts. 3. **Make the ping itself resilient** — short timeout, a couple of retries with backoff, and fail-open: a heartbeat-endpoint outage must never crash the actual job or block its real work. Log the ping outcome but swallow its errors. 4. **Set the schedule alarm** — explain configuring the monitor's expected period and grace window to match the schedule plus the runtime envelope, so a normally-slow run does not false-page while a truly dead job pages fast. 5. **Both languages** — provide a reusable Python context manager (`with heartbeat(slug): do_work()`) that pings start/success/fail automatically, and a Bash equivalent using `curl --max-time` wired into the EXIT/ERR traps. 6. **Identify the run** — include a run ID, hostname, and optionally a tail of the log in the fail ping so the page links straight to context. 7. **Avoid the meta-failure** — note that the heartbeat config itself can rot (wrong slug, expired token); recommend a periodic "test the alerting" check and alerting on missing start pings, not only missing success pings. Output: (a) the Python heartbeat context manager with retries and fail-open behavior, (b) the Bash trap-based version, (c) the monitor schedule/grace configuration, (d) a checklist for verifying the alert actually fires by simulating a missed run. Be opinionated: success means real work done, the ping must never break the job, and alert on silence — not just on errors.