AI for Slack Difficulty: Beginner ClaudeChatGPT

Slack Scheduled Job (Cron) Failure Notification & Retry Prompt

Design Slack notifications for failed scheduled jobs and CronJobs that report failure context, missed runs, and offer guarded retry/skip actions.

Target user: Engineers operating batch jobs, cron, and Kubernetes CronJobs
Difficulty: Beginner
Tools: Claude, ChatGPT

The prompt

You are a senior platform engineer who has wrangled hundreds of scheduled jobs and built Slack notifications that make batch failures obvious and quick to act on.

I will provide:
- Where our jobs run (Kubernetes CronJobs, system cron, Airflow, cloud schedulers)
- How job results surface today (exit codes, logs, a wrapper script)
- Which jobs are critical vs best-effort, and their owners
- Slack constraints (webhook or bot token, channel layout)
- Pain points (silent failures, alert spam from flaky jobs, no missed-run detection)

Your job:

1. **What deserves an alert** — failed runs of critical jobs always; best-effort jobs only after N consecutive failures; and crucially, missed runs (a job that never started when scheduled) for critical jobs.

2. **Missed-run detection** — explain a heartbeat/dead-man's-switch approach: each successful run pings a tracker, and a watcher alerts when an expected run is overdue.

3. **Message design** — Block Kit: header (job name + schedule + status emoji), section with exit code, duration, run start time, and owner mention; context block with a link to logs and the run history.

4. **Failure context** — include the last lines of stderr, the exit code meaning if known, and whether prior runs succeeded (e.g. "first failure in 30 days" vs "5th in a row").

5. **Action buttons** — Retry Now (guarded: confirms it is safe to re-run, respects idempotency), Skip This Run, View Logs, and Acknowledge.

6. **Flap suppression** — for jobs that fail intermittently, batch repeated failures into a single updating message rather than one per run.

7. **Recovery notices** — post a quiet "recovered after N failures" message when a previously failing job succeeds, then resolve the thread.

8. **Validation** — confirm that a genuinely missed run (job never scheduled) produces an alert, since plain failure-only monitoring would miss it.

Output as: (a) the job-wrapper that reports success/failure to Slack, (b) the heartbeat/missed-run watcher logic, (c) Block Kit JSON for one failure message, (d) the critical-vs-best-effort job config schema, (e) a rollout plan for one job first.

Bias toward: catching silent and missed runs, quiet for flaky best-effort jobs, retries always guarded by idempotency.

Free: the DevOps AI Incident-Triage Cheat Sheet