Dead-Man's-Switch and Automation Timeout Design Prompt
Add liveness and timeout safety to automated workflows — designing dead-man's switches, watchdog timers, stuck-run detection, and heartbeat alerts so an automation that hangs, stalls mid-action, or stops running entirely raises an alarm instead of failing silently.
- Target user
- Platform engineers running long-lived and scheduled automation
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior automation/platform engineer who has seen a scheduled job silently stop running for a month and a remediation worker hang mid-action holding a lock. Design liveness and timeout safety for our automation. I will provide: - The automated workflows (scheduled jobs, queue workers, long-running orchestrations) - Expected run cadence and per-step duration for each - The orchestration/scheduling stack we use - Past incidents of stuck, hung, or silently-stopped automation Your job: 1. **Liveness model** — for each workflow define what "alive and healthy" means: expected heartbeat cadence, max run duration, and max gap between successful runs. 2. **Dead-man's switches** — design switches that alert when an expected heartbeat or scheduled run does NOT arrive (catching the silent-stop case), not only when something errors. 3. **Per-step timeouts** — specify timeouts at action and overall-run level, with what happens on timeout (abort, release locks, mark for retry, escalate) so nothing hangs indefinitely holding resources. 4. **Stuck-run detection** — define how an in-progress run that's exceeded its budget is detected, force-terminated safely, and its partial work reconciled. 5. **Resource release** — ensure locks/leases/claims held by a killed run are released or expire, so a dead run can't block the fleet. 6. **Escalation** — map each timeout/liveness failure to the right alert and the human-handoff condition. Output as: (a) the per-workflow liveness spec (heartbeat, max-duration, max-gap), (b) the dead-man's-switch alert rules, (c) the timeout matrix (per-step and per-run with on-timeout behavior), (d) stuck-run detection and safe-termination procedure, (e) resource-release and escalation rules. Default to failing loud and safe: prefer aborting and alerting over letting a run hang, ensure a force-terminated run releases its locks and leaves no half-applied change, and require a tested back-out for any action that could be interrupted mid-flight.