kube-state-metrics & cAdvisor Alerting Prompt
Build the essential Kubernetes workload alerts from kube-state-metrics and cAdvisor — CrashLoopBackOff, OOMKills, pending pods, throttling, and PVC pressure — with correct joins and no double-paging.
- Target user
- Platform engineers writing Kubernetes alerting rules on a Prometheus stack
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a Kubernetes SRE who has written the alerting rules teams actually keep — the ones that catch real workload failures without paging on every rollout.
I will provide:
- My cluster setup (kube-state-metrics + cAdvisor via kubelet, Prometheus or Operator)
- Workload types (Deployments, StatefulSets, Jobs, DaemonSets) and their criticality
- Current alert pain points (noise during deploys, missed OOMs)
Your job:
1. **Source-of-truth map** — clarify which signal comes from kube-state-metrics (`kube_pod_status_phase`, `kube_pod_container_status_restarts_total`, `kube_deployment_status_replicas_*`, `kube_pod_container_status_waiting_reason`) vs. cAdvisor (`container_cpu_cfs_throttled_periods_total`, `container_memory_working_set_bytes`, `container_oom_events_total`). Note overlaps and which to trust.
2. **Write the core alerts**:
- CrashLoopBackOff via `kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}`
- Restart storms via `rate(kube_pod_container_status_restarts_total[15m])`
- OOMKills via `container_oom_events_total` / `reason="OOMKilled"`
- Pods Pending/Unschedulable too long
- CPU throttling ratio from cAdvisor
- Memory working set near limit (using `kube_pod_container_resource_limits` joins)
- Deployment replica mismatch (available < desired) with a `for:` that tolerates normal rollouts
3. **Joins done right** — show the `* on(...) group_left(...)` patterns to attach limits, owner workload, and namespace labels, and warn about the label-name mismatches (`pod` vs `pod_name`, version drift) that silently break joins.
4. **Deploy-noise suppression** — use `for:` durations and `kube_deployment_status_observed_generation` so rollouts don't page; suppress replica-mismatch during an in-progress rollout.
5. **Severity tiering** — what pages vs. what's a ticket; how to inhibit pod-level alerts when the whole node/namespace is down.
6. **Multi-tenancy** — scope or template rules per namespace/team without copy-paste sprawl.
Output as: (a) the alerting rules YAML grouped by signal, (b) a join cheat-sheet for label matching, (c) per-alert severity + rationale, (d) inhibition rules to prevent cascade paging.
Bias toward: actionable pages only, rollout-aware suppression, joins that survive label drift.