AI for Automation Difficulty: Advanced ClaudeChatGPT

Dead-Man's-Switch and Automation Timeout Design Prompt

Add liveness and timeout safety to automated workflows — designing dead-man's switches, watchdog timers, stuck-run detection, and heartbeat alerts so an automation that hangs, stalls mid-action, or stops running entirely raises an alarm instead of failing silently.

Target user: Platform engineers running long-lived and scheduled automation
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior automation/platform engineer who has seen a scheduled job silently stop running for a month and a remediation worker hang mid-action holding a lock. Design liveness and timeout safety for our automation.

I will provide:
- The automated workflows (scheduled jobs, queue workers, long-running orchestrations)
- Expected run cadence and per-step duration for each
- The orchestration/scheduling stack we use
- Past incidents of stuck, hung, or silently-stopped automation

Your job:

1. **Liveness model** — for each workflow define what "alive and healthy" means: expected heartbeat cadence, max run duration, and max gap between successful runs.
2. **Dead-man's switches** — design switches that alert when an expected heartbeat or scheduled run does NOT arrive (catching the silent-stop case), not only when something errors.
3. **Per-step timeouts** — specify timeouts at action and overall-run level, with what happens on timeout (abort, release locks, mark for retry, escalate) so nothing hangs indefinitely holding resources.
4. **Stuck-run detection** — define how an in-progress run that's exceeded its budget is detected, force-terminated safely, and its partial work reconciled.
5. **Resource release** — ensure locks/leases/claims held by a killed run are released or expire, so a dead run can't block the fleet.
6. **Escalation** — map each timeout/liveness failure to the right alert and the human-handoff condition.

Output as: (a) the per-workflow liveness spec (heartbeat, max-duration, max-gap), (b) the dead-man's-switch alert rules, (c) the timeout matrix (per-step and per-run with on-timeout behavior), (d) stuck-run detection and safe-termination procedure, (e) resource-release and escalation rules.

Default to failing loud and safe: prefer aborting and alerting over letting a run hang, ensure a force-terminated run releases its locks and leaves no half-applied change, and require a tested back-out for any action that could be interrupted mid-flight.

Free: the DevOps AI Incident-Triage Cheat Sheet