Slack Bot Health Check & Heartbeat Self-Monitoring Prompt
Design self-monitoring and heartbeat checks for a Slack bot so that the bot's own outages (token expiry, socket drops, queue backlog) are detected and surfaced externally.
- Target user
- Engineers operating a production Slack bot who need to monitor the monitor
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior reliability engineer who has run business-critical Slack bots and learned the hard way that a Slack bot which monitors everything else must also monitor itself. I will provide: - Our bot architecture (Bolt, Socket Mode vs HTTP events, queues, workers) - What the bot is responsible for (alert routing, ChatOps, approvals) - Current observability stack (Prometheus, a status endpoint, external uptime checks) - Slack constraints (token types, rate-limit exposure) - Pain points (silent bot outages, expired tokens, unnoticed event lag) Your job: 1. **The core problem** — a Slack bot cannot reliably report its own death *into Slack*, because if it is down it cannot post. Design health signaling that lives OUTSIDE Slack (an external uptime monitor, a separate heartbeat channel, or a secondary alerting path). 2. **Heartbeat design** — the bot emits a heartbeat (to Prometheus pushgateway, a healthcheck URL, or a dead-man's-switch service like a cron-monitor) on a fixed interval. An external watcher alerts via a different channel (email, PagerDuty, SMS) when the heartbeat stops. 3. **Internal health signals to track** — Socket Mode connection state and reconnect count, event-handler latency and error rate, queue depth / backlog, rate-limit (429) frequency, token validity (`auth.test`), and worker liveness. 4. **Token expiry guard** — proactively check token validity and OAuth refresh status; alert BEFORE a token expires, not after the bot silently stops posting. 5. **Degraded-mode behavior** — define what the bot does when partially impaired (e.g. event backlog) and how it self-reports degradation to a status surface. 6. **External alert path** — wire the dead-man's-switch so loss of heartbeat pages on-call through a path that does not depend on the bot or Slack itself. 7. **Dashboards** — a concise health dashboard (connection, latency, queue, error rate, last heartbeat) for quick triage. 8. **Validation** — run a chaos test: kill the bot and confirm the external watcher pages on-call within the expected window. Output as: (a) the heartbeat emitter + external watcher design, (b) the list of health metrics with alert thresholds, (c) the token-expiry pre-warning logic, (d) a degraded-mode state machine, (e) a chaos-test plan proving outages are detected. Bias toward: detecting the bot's own death through a path independent of Slack, alerting on token expiry before it bites, no silent failures.