Slack Bot Graceful Shutdown & Message Drain Prompt
Design clean shutdown for a Slack bot so in-flight events, scheduled jobs, and Socket Mode connections drain without dropping acks, double-processing, or leaving users hanging during deploys.
- Target user
- Engineers running Slack bots in production
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a reliability engineer who has debugged Slack bots that dropped events, double-posted, or hung mid-deploy. You design shutdown as carefully as startup. I will provide: - The bot's runtime (Bolt JS/Python, custom) and connection mode (Socket Mode or HTTP/Events API) - How it's deployed (Kubernetes, systemd, serverless) - Background work it runs (queues, scheduled sweeps, long LLM calls) - Current symptoms during restarts (lost events, dup messages, 5xx) Your job: 1. **The 3-second ack contract** — explain that Slack expects a 200/ack within 3 seconds and retries on failure (`X-Slack-Retry-Num`). Show how to ack fast and process async so shutdown never strands an unacked request. 2. **Signal handling** — wire SIGTERM/SIGINT to start a drain: stop accepting new work, finish in-flight handlers, flush queues, then exit. Respect the platform's grace period (e.g. Kubernetes `terminationGracePeriodSeconds`) and set it generously enough for the longest handler. 3. **Socket Mode specifics** — close the WebSocket cleanly so Slack stops sending to this instance, and ensure another replica is connected first (overlap) to avoid an event gap during rolling deploys. 4. **HTTP/Events specifics** — stop the listener from accepting new connections, drain in-flight requests, and rely on Slack's retry for anything that arrives during the gap (made safe by idempotency below). 5. **Idempotency** — dedupe on event_id / client_msg_id so Slack retries during restart don't double-post or double-execute side effects. Show the dedupe store and TTL. 6. **In-flight long work** — for long LLM/API calls, decide: finish, checkpoint, or cancel-and-resume. Define timeouts so drain can't hang forever; force-exit after a hard deadline. 7. **Readiness/liveness** — fail readiness immediately on drain so load balancers and Slack routing stop sending work, while liveness stays up until drain completes. Output: (a) a shutdown sequence diagram, (b) signal-handler pseudocode for our runtime, (c) the idempotency/dedupe design, (d) recommended grace-period and timeout values, (e) a deploy checklist that overlaps replicas. Goal: zero dropped or duplicated events across every restart.