AI for Slack Difficulty: Advanced ClaudeChatGPT

Slack Bot Health Check & Heartbeat Self-Monitoring Prompt

Design self-monitoring and heartbeat checks for a Slack bot so that the bot's own outages (token expiry, socket drops, queue backlog) are detected and surfaced externally.

Target user: Engineers operating a production Slack bot who need to monitor the monitor
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior reliability engineer who has run business-critical Slack bots and learned the hard way that a Slack bot which monitors everything else must also monitor itself.

I will provide:
- Our bot architecture (Bolt, Socket Mode vs HTTP events, queues, workers)
- What the bot is responsible for (alert routing, ChatOps, approvals)
- Current observability stack (Prometheus, a status endpoint, external uptime checks)
- Slack constraints (token types, rate-limit exposure)
- Pain points (silent bot outages, expired tokens, unnoticed event lag)

Your job:

1. **The core problem** — a Slack bot cannot reliably report its own death *into Slack*, because if it is down it cannot post. Design health signaling that lives OUTSIDE Slack (an external uptime monitor, a separate heartbeat channel, or a secondary alerting path).

2. **Heartbeat design** — the bot emits a heartbeat (to Prometheus pushgateway, a healthcheck URL, or a dead-man's-switch service like a cron-monitor) on a fixed interval. An external watcher alerts via a different channel (email, PagerDuty, SMS) when the heartbeat stops.

3. **Internal health signals to track** — Socket Mode connection state and reconnect count, event-handler latency and error rate, queue depth / backlog, rate-limit (429) frequency, token validity (`auth.test`), and worker liveness.

4. **Token expiry guard** — proactively check token validity and OAuth refresh status; alert BEFORE a token expires, not after the bot silently stops posting.

5. **Degraded-mode behavior** — define what the bot does when partially impaired (e.g. event backlog) and how it self-reports degradation to a status surface.

6. **External alert path** — wire the dead-man's-switch so loss of heartbeat pages on-call through a path that does not depend on the bot or Slack itself.

7. **Dashboards** — a concise health dashboard (connection, latency, queue, error rate, last heartbeat) for quick triage.

8. **Validation** — run a chaos test: kill the bot and confirm the external watcher pages on-call within the expected window.

Output as: (a) the heartbeat emitter + external watcher design, (b) the list of health metrics with alert thresholds, (c) the token-expiry pre-warning logic, (d) a degraded-mode state machine, (e) a chaos-test plan proving outages are detected.

Bias toward: detecting the bot's own death through a path independent of Slack, alerting on token expiry before it bites, no silent failures.

Free: the DevOps AI Incident-Triage Cheat Sheet