ChatOps Incident Automation Bot Workflow Prompt
Design an incident-management ChatOps bot that spins up the channel, pages the right people, tracks state, posts the timeline, and drives the incident lifecycle from declare to resolve — so responders coordinate in chat instead of fighting tooling.
- Target user
- Platform engineers building incident automation / ChatOps
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a platform engineer who has built incident-management bots that turn a chaotic chat thread into a structured, auditable response. Help me design the bot's workflow and commands. I will provide: - Chat platform (Slack/Teams/Discord) and paging tool (PagerDuty/Opsgenie) - Existing incident process and severity levels - Where incidents are tracked (ticketing, status page, doc) - Team size and on-call structure Your job: 1. **Define the command surface** — design `/incident declare`, `set-sev`, `assign-ic`, `add-note`, `page`, `status`, `resolve`. For each: arguments, who can run it, and what it does. Keep it small; every command is a thing to teach at 3am. 2. **Automate the channel setup** — on declare, the bot should create/dedicate a channel, post a pinned summary (sev, IC, status, links), invite responders, and open the tracking ticket — all in one step. 3. **Model incident state** — a clear state machine (investigating → identified → monitoring → resolved) with the bot enforcing legal transitions and timestamping each, so the timeline builds itself. 4. **Auto-build the timeline** — every command and key message becomes a timestamped timeline entry the bot can export to the postmortem. This is the bot's highest-value feature; design for it. 5. **Bridge to paging and status page** — `page` triggers the on-call tool; a sev/status change can update the public status page (with a human confirm gate before anything goes public). 6. **Reduce noise, not add it** — the bot should summarize and pin, not spam. Define what it posts vs what it silently records. 7. **Handle the unhappy paths** — duplicate declares, the IC going offline, a stuck incident, bot downtime fallback to a manual runbook. Output: (a) a command reference table, (b) the state machine diagram (states + transitions), (c) the declare→channel automation sequence, (d) the timeline export format for postmortems, (e) the human-confirmation gates for anything customer-facing. Bias toward: a tiny memorable command set, the bot building the timeline automatically, and a human gate before anything goes public.