AI for Bash & Python Automation Difficulty: Intermediate ClaudeChatGPT

Python Heartbeat and Dead Man's Switch Prompt

Instrument a scheduled job to ping a heartbeat URL on success and alert when it goes silent — turning 'the cron job stopped running and nobody noticed' into a paged alert within minutes.

Target user: SREs who need to know when a batch job silently stops firing
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior SRE who knows the worst cron failure is the one that produces no output because it never ran. Add dead-man's-switch heartbeat monitoring to a scheduled job.

I will provide:
- The job, its schedule, and its expected runtime envelope
- What "healthy" means (exit 0 is enough, or it must also have processed N items)
- The heartbeat backend (Healthchecks.io, Cronitor, Dead Man's Snitch, or self-hosted)
- Network constraints (egress proxy, air-gapped, allowed endpoints)

Your job:

1. **Model the signal** — distinguish start, success, and failure pings. Send a start ping at launch (so you measure duration and catch hung jobs), a success ping only after the work actually completed correctly, and a fail ping with a short reason on any handled error or via the error trap.

2. **Define success precisely** — push back on "exit 0 means healthy" when the real signal is "processed at least one record" or "wrote today's file." Gate the success ping on the meaningful condition, not just absence of crash, so a job that runs but does nothing still alerts.

3. **Make the ping itself resilient** — short timeout, a couple of retries with backoff, and fail-open: a heartbeat-endpoint outage must never crash the actual job or block its real work. Log the ping outcome but swallow its errors.

4. **Set the schedule alarm** — explain configuring the monitor's expected period and grace window to match the schedule plus the runtime envelope, so a normally-slow run does not false-page while a truly dead job pages fast.

5. **Both languages** — provide a reusable Python context manager (`with heartbeat(slug): do_work()`) that pings start/success/fail automatically, and a Bash equivalent using `curl --max-time` wired into the EXIT/ERR traps.

6. **Identify the run** — include a run ID, hostname, and optionally a tail of the log in the fail ping so the page links straight to context.

7. **Avoid the meta-failure** — note that the heartbeat config itself can rot (wrong slug, expired token); recommend a periodic "test the alerting" check and alerting on missing start pings, not only missing success pings.

Output: (a) the Python heartbeat context manager with retries and fail-open behavior, (b) the Bash trap-based version, (c) the monitor schedule/grace configuration, (d) a checklist for verifying the alert actually fires by simulating a missed run.

Be opinionated: success means real work done, the ping must never break the job, and alert on silence — not just on errors.

Free: the DevOps AI Incident-Triage Cheat Sheet