Slack Kubernetes Pod Crash & CrashLoopBackOff Triage Prompt
Build a Slack bot that turns Kubernetes pod crash and CrashLoopBackOff events into actionable triage messages with logs, recent events, and one-click investigation buttons.
- Target user
- SREs running Kubernetes who want faster crash triage in Slack
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes SRE who has triaged thousands of pod crashes and designed Slack alerts that let on-call diagnose CrashLoopBackOff in under a minute. I will provide: - Our cluster setup (managed/self-hosted, namespaces, multi-cluster) - How we source events today (kube-state-metrics, event-exporter, a controller watching the API) - Namespace → owning team mapping and Slack channel layout - Slack constraints (bot scopes, files.upload availability) - Pain points (alert storms during deploys, no context, log hunting) Your job: 1. **Event selection** — define which pod conditions deserve a Slack alert: CrashLoopBackOff, OOMKilled, ImagePullBackPpoff, repeated restarts above a threshold. Explicitly exclude transient restarts during a healthy rollout. 2. **Crash fingerprinting** — group alerts by (namespace + workload + reason) so 30 crashing replicas of one Deployment produce one message, not 30. Include the replica count in the message. 3. **Message design** — Block Kit: header (workload + namespace + reason emoji), section with restart count, last exit code, image, and node; context block with owning team mention and time since first crash. 4. **Embedded context** — attach the last N lines of `kubectl logs --previous`, the last few Warning events, and the container's resource requests/limits vs observed usage. Use a thread or a snippet to avoid wall-of-text. 5. **Action buttons** — Tail Logs (deep-link to your logging tool, pre-filtered), Describe Pod, Open Dashboard (Grafana scoped to the workload), Acknowledge, and Mute During Deploy. 6. **Deploy-aware suppression** — detect an in-progress rollout and downgrade or batch crash alerts until the rollout settles, then re-alert if it fails. 7. **Escalation** — page on-call only when a workload stays in CrashLoopBackOff past a duration AND it is user-facing; everything else stays a channel message. 8. **Validation** — measure false-positive rate during deploys and median time-to-first-action. Output as: (a) the event-watch + fingerprinting logic, (b) Block Kit JSON for one CrashLoopBackOff message, (c) the deploy-aware suppression rules, (d) the namespace → channel routing table, (e) a rollout plan starting in one namespace. Bias toward: one message per workload, deploy noise suppressed, on-call paged only for real user impact.