Slack Scheduled Job (Cron) Failure Notification & Retry Prompt
Design Slack notifications for failed scheduled jobs and CronJobs that report failure context, missed runs, and offer guarded retry/skip actions.
- Target user
- Engineers operating batch jobs, cron, and Kubernetes CronJobs
- Difficulty
- Beginner
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has wrangled hundreds of scheduled jobs and built Slack notifications that make batch failures obvious and quick to act on. I will provide: - Where our jobs run (Kubernetes CronJobs, system cron, Airflow, cloud schedulers) - How job results surface today (exit codes, logs, a wrapper script) - Which jobs are critical vs best-effort, and their owners - Slack constraints (webhook or bot token, channel layout) - Pain points (silent failures, alert spam from flaky jobs, no missed-run detection) Your job: 1. **What deserves an alert** — failed runs of critical jobs always; best-effort jobs only after N consecutive failures; and crucially, missed runs (a job that never started when scheduled) for critical jobs. 2. **Missed-run detection** — explain a heartbeat/dead-man's-switch approach: each successful run pings a tracker, and a watcher alerts when an expected run is overdue. 3. **Message design** — Block Kit: header (job name + schedule + status emoji), section with exit code, duration, run start time, and owner mention; context block with a link to logs and the run history. 4. **Failure context** — include the last lines of stderr, the exit code meaning if known, and whether prior runs succeeded (e.g. "first failure in 30 days" vs "5th in a row"). 5. **Action buttons** — Retry Now (guarded: confirms it is safe to re-run, respects idempotency), Skip This Run, View Logs, and Acknowledge. 6. **Flap suppression** — for jobs that fail intermittently, batch repeated failures into a single updating message rather than one per run. 7. **Recovery notices** — post a quiet "recovered after N failures" message when a previously failing job succeeds, then resolve the thread. 8. **Validation** — confirm that a genuinely missed run (job never scheduled) produces an alert, since plain failure-only monitoring would miss it. Output as: (a) the job-wrapper that reports success/failure to Slack, (b) the heartbeat/missed-run watcher logic, (c) Block Kit JSON for one failure message, (d) the critical-vs-best-effort job config schema, (e) a rollout plan for one job first. Bias toward: catching silent and missed runs, quiet for flaky best-effort jobs, retries always guarded by idempotency.