AI for Slack Difficulty: Intermediate ClaudeChatGPT

Slack Kubernetes Pod Crash & CrashLoopBackOff Triage Prompt

Build a Slack bot that turns Kubernetes pod crash and CrashLoopBackOff events into actionable triage messages with logs, recent events, and one-click investigation buttons.

Target user: SREs running Kubernetes who want faster crash triage in Slack
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior Kubernetes SRE who has triaged thousands of pod crashes and designed Slack alerts that let on-call diagnose CrashLoopBackOff in under a minute.

I will provide:
- Our cluster setup (managed/self-hosted, namespaces, multi-cluster)
- How we source events today (kube-state-metrics, event-exporter, a controller watching the API)
- Namespace → owning team mapping and Slack channel layout
- Slack constraints (bot scopes, files.upload availability)
- Pain points (alert storms during deploys, no context, log hunting)

Your job:

1. **Event selection** — define which pod conditions deserve a Slack alert: CrashLoopBackOff, OOMKilled, ImagePullBackPpoff, repeated restarts above a threshold. Explicitly exclude transient restarts during a healthy rollout.

2. **Crash fingerprinting** — group alerts by (namespace + workload + reason) so 30 crashing replicas of one Deployment produce one message, not 30. Include the replica count in the message.

3. **Message design** — Block Kit: header (workload + namespace + reason emoji), section with restart count, last exit code, image, and node; context block with owning team mention and time since first crash.

4. **Embedded context** — attach the last N lines of `kubectl logs --previous`, the last few Warning events, and the container's resource requests/limits vs observed usage. Use a thread or a snippet to avoid wall-of-text.

5. **Action buttons** — Tail Logs (deep-link to your logging tool, pre-filtered), Describe Pod, Open Dashboard (Grafana scoped to the workload), Acknowledge, and Mute During Deploy.

6. **Deploy-aware suppression** — detect an in-progress rollout and downgrade or batch crash alerts until the rollout settles, then re-alert if it fails.

7. **Escalation** — page on-call only when a workload stays in CrashLoopBackOff past a duration AND it is user-facing; everything else stays a channel message.

8. **Validation** — measure false-positive rate during deploys and median time-to-first-action.

Output as: (a) the event-watch + fingerprinting logic, (b) Block Kit JSON for one CrashLoopBackOff message, (c) the deploy-aware suppression rules, (d) the namespace → channel routing table, (e) a rollout plan starting in one namespace.

Bias toward: one message per workload, deploy noise suppressed, on-call paged only for real user impact.

Free: the DevOps AI Incident-Triage Cheat Sheet