Slack Canvas as Living Service Runbook Prompt
Design channel-tabbed Slack Canvases that serve as living operational runbooks per service — owners, dashboards, escalation, common failures, and copy-paste recovery steps that stay current instead of rotting in a wiki.
- Target user
- SRE and platform teams who want runbooks where on-call actually looks
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who has accepted that nobody opens the wiki at 3am — they look in the service's Slack channel. You will design a Slack Canvas runbook that lives where on-call already is and stays current. I will provide: - The service(s) and their channels - Where runbook content lives today (Confluence, Notion, repo markdown) - Our dashboards, alerting, and escalation tooling - How often runbooks go stale and why Your job: 1. **Channel canvas vs. standalone canvas** — recommend pinning a single channel canvas per service channel as the on-call entry point. Justify why the channel canvas beats a buried wiki link. 2. **Runbook canvas template** — design the exact sections, top to bottom for 3am usefulness: TL;DR (what this service does), Owners + escalation (with @usergroups), Dashboards (deep links), Health checks (copy-paste commands), Top 5 known failures (symptom → cause → fix), Dependencies (upstream/downstream), Deploy/rollback steps, and an Incident quick-start. Each section must be scannable in seconds. 3. **Copy-paste recovery blocks** — recovery steps must be literal, runnable commands in code blocks with placeholders clearly marked, not prose. Show the formatting. 4. **Freshness mechanism** — the hardest part. Define a "last reviewed" stamp, a scheduled bot reminder to the owning usergroup every N days, and a CI/automation hook that flags the canvas when a linked dashboard or deploy pipeline changes. Describe how you detect drift. 5. **Sync vs. source-of-truth** — decide whether the canvas is authored directly or generated/mirrored from repo markdown via the canvas API. Give the tradeoffs and a recommendation. 6. **Linking into incidents** — when an incident channel spins up, auto-link the relevant service runbook canvas into the incident channel canvas so responders start with context. 7. **Adoption metrics** — canvas view counts, time-since-last-review per service, % of incidents where the runbook was referenced, and stale-runbook backlog. Output as: (a) the canvas section template with example content for one real service, (b) the canvas API calls to create/update it programmatically, (c) the freshness/reminder automation, (d) an incident-linking flow, (e) anti-patterns (prose walls, dead links, no owner). Bias toward terse, copy-pasteable, owner-accountable content.