Slack Observability Dashboard Surface Prompt
Surface real-time observability data directly in Slack — live SLO status, recent deploys, current incidents, top-error queries — via slash commands and scheduled posts that bring the dashboard to where engineers already are.
- Target user
- SRE / platform leads reducing the gap between alerts and dashboards
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has reduced incident MTTR by bringing observability data directly into Slack — instead of asking engineers to context-switch to Grafana / Datadog every time. I will provide: - Observability tools (Grafana / Datadog / New Relic / Prometheus / OpenTelemetry) - Most-asked dashboard questions during incidents - Existing Slack alert pipelines - Pain points (engineers hunting for the right dashboard URL while the incident is hot) Your job: 1. **What's worth surfacing in Slack** — be opinionated: - **Always**: current SLO state per critical service, current incident count, deploy clock - **On request via slash command**: error rate for service X, top errors last 1h, deploy diff - **Scheduled**: daily SLO digest, weekly DORA metrics, monthly capacity trend - **Avoid**: full dashboards as screenshots (use links), every metric (signal-to-noise) 2. **Slash command catalog**: - `/status <service>` — current SLO state, recent deploys, current incidents - `/errors <service> [window]` — top 10 errors with counts + sample messages - `/deploy <service>` — last 5 deploys with success/failure + duration - `/slo <service>` — burn rate windows, budget remaining, recent reset history - `/dash <service>` — direct link to the right dashboard with time window pre-filtered - `/oncall <team>` — who's on call now + escalation chain 3. **Output format** — each command returns a Block Kit message with: - Header: service + the question answered + as-of timestamp - FactSet: 3-5 key numbers - Sparkline or mini-trend (unicode chars work in Block Kit text) - Link to the full dashboard with time window - Refresh button (re-runs the command and edits the message) 4. **Scheduled posts**: - **Morning SLO digest** — each weekday at 9am team-local, posted to team channel: services with budget < 50%, fast-burn alerts last 24h - **Weekly DORA** — Mondays: deploy frequency, lead time, change failure rate, MTTR - **Incident heatmap** — Fridays: incidents by service this week vs last 5. **Time-window magic** — when surfacing dashboard data: - Default to "last 1 hour" for live commands - For incident-context commands (called from within an incident channel), default to "since incident started" - For deploy-related, default to "since last deploy" 6. **Caching strategy**: - Aggregate metrics: 1-min cache (don't hammer the metrics backend) - Per-service status: 30-second cache - Top errors: 5-min cache (errors don't change that fast) - Pure URL generators (`/dash`): no cache 7. **Permission model**: - Read commands (`/status`, `/errors`): any engineer - Write/mutating (`/silence`, `/ack`): on-call only 8. **Error handling**: - Metrics backend down → return a graceful "metrics unavailable, here's the dashboard link" - Slow response → return preliminary data with "refreshing..." indicator - User asks about non-existent service → fuzzy-match suggest similar names 9. **Anti-patterns to avoid**: - Dumping all metrics in a single command - No link back to the full dashboard (people will want depth) - Slow responses that make Slack timeout - Caching that surfaces stale data during a fast-moving incident - Replacing dashboards instead of complementing them 10. **Measure success** — % of incident-channel chatter that's "what's the URL for X dashboard" before vs after. % of engineers using slash commands during incidents. MTTR delta. Output as: (a) slash command catalog with response formats, (b) Block Kit JSON for one command output, (c) scheduled post schedule + content, (d) time-window heuristics, (e) caching strategy, (f) permission matrix, (g) error handling, (h) success metrics. Bias toward: bring the answer to where the question is asked, complement dashboards don't replace, fast > comprehensive.