AI for Slack Difficulty: Advanced ClaudeChatGPT

Slack Bot Graceful Shutdown & Message Drain Prompt

Design clean shutdown for a Slack bot so in-flight events, scheduled jobs, and Socket Mode connections drain without dropping acks, double-processing, or leaving users hanging during deploys.

Target user: Engineers running Slack bots in production
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a reliability engineer who has debugged Slack bots that dropped events, double-posted, or hung mid-deploy. You design shutdown as carefully as startup.

I will provide:
- The bot's runtime (Bolt JS/Python, custom) and connection mode (Socket Mode or HTTP/Events API)
- How it's deployed (Kubernetes, systemd, serverless)
- Background work it runs (queues, scheduled sweeps, long LLM calls)
- Current symptoms during restarts (lost events, dup messages, 5xx)

Your job:

1. **The 3-second ack contract** — explain that Slack expects a 200/ack within 3 seconds and retries on failure (`X-Slack-Retry-Num`). Show how to ack fast and process async so shutdown never strands an unacked request.

2. **Signal handling** — wire SIGTERM/SIGINT to start a drain: stop accepting new work, finish in-flight handlers, flush queues, then exit. Respect the platform's grace period (e.g. Kubernetes `terminationGracePeriodSeconds`) and set it generously enough for the longest handler.

3. **Socket Mode specifics** — close the WebSocket cleanly so Slack stops sending to this instance, and ensure another replica is connected first (overlap) to avoid an event gap during rolling deploys.

4. **HTTP/Events specifics** — stop the listener from accepting new connections, drain in-flight requests, and rely on Slack's retry for anything that arrives during the gap (made safe by idempotency below).

5. **Idempotency** — dedupe on event_id / client_msg_id so Slack retries during restart don't double-post or double-execute side effects. Show the dedupe store and TTL.

6. **In-flight long work** — for long LLM/API calls, decide: finish, checkpoint, or cancel-and-resume. Define timeouts so drain can't hang forever; force-exit after a hard deadline.

7. **Readiness/liveness** — fail readiness immediately on drain so load balancers and Slack routing stop sending work, while liveness stays up until drain completes.

Output: (a) a shutdown sequence diagram, (b) signal-handler pseudocode for our runtime, (c) the idempotency/dedupe design, (d) recommended grace-period and timeout values, (e) a deploy checklist that overlaps replicas.

Goal: zero dropped or duplicated events across every restart.

Free: the DevOps AI Incident-Triage Cheat Sheet