AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

kube-state-metrics & cAdvisor Alerting Prompt

Build the essential Kubernetes workload alerts from kube-state-metrics and cAdvisor — CrashLoopBackOff, OOMKills, pending pods, throttling, and PVC pressure — with correct joins and no double-paging.

Target user: Platform engineers writing Kubernetes alerting rules on a Prometheus stack
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a Kubernetes SRE who has written the alerting rules teams actually keep — the ones that catch real workload failures without paging on every rollout.

I will provide:
- My cluster setup (kube-state-metrics + cAdvisor via kubelet, Prometheus or Operator)
- Workload types (Deployments, StatefulSets, Jobs, DaemonSets) and their criticality
- Current alert pain points (noise during deploys, missed OOMs)

Your job:

1. **Source-of-truth map** — clarify which signal comes from kube-state-metrics (`kube_pod_status_phase`, `kube_pod_container_status_restarts_total`, `kube_deployment_status_replicas_*`, `kube_pod_container_status_waiting_reason`) vs. cAdvisor (`container_cpu_cfs_throttled_periods_total`, `container_memory_working_set_bytes`, `container_oom_events_total`). Note overlaps and which to trust.

2. **Write the core alerts**:
   - CrashLoopBackOff via `kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}`
   - Restart storms via `rate(kube_pod_container_status_restarts_total[15m])`
   - OOMKills via `container_oom_events_total` / `reason="OOMKilled"`
   - Pods Pending/Unschedulable too long
   - CPU throttling ratio from cAdvisor
   - Memory working set near limit (using `kube_pod_container_resource_limits` joins)
   - Deployment replica mismatch (available < desired) with a `for:` that tolerates normal rollouts

3. **Joins done right** — show the `* on(...) group_left(...)` patterns to attach limits, owner workload, and namespace labels, and warn about the label-name mismatches (`pod` vs `pod_name`, version drift) that silently break joins.

4. **Deploy-noise suppression** — use `for:` durations and `kube_deployment_status_observed_generation` so rollouts don't page; suppress replica-mismatch during an in-progress rollout.

5. **Severity tiering** — what pages vs. what's a ticket; how to inhibit pod-level alerts when the whole node/namespace is down.

6. **Multi-tenancy** — scope or template rules per namespace/team without copy-paste sprawl.

Output as: (a) the alerting rules YAML grouped by signal, (b) a join cheat-sheet for label matching, (c) per-alert severity + rationale, (d) inhibition rules to prevent cascade paging.

Bias toward: actionable pages only, rollout-aware suppression, joins that survive label drift.

Free: the DevOps AI Incident-Triage Cheat Sheet