AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes OOMKilled Memory Limit Diagnosis Prompt

Diagnose why containers are OOMKilled — distinguish container limit kills from node-level memory pressure, working-set growth, and JVM/heap-vs-RSS gaps, then right-size limits.

Target user: platform engineers running Kubernetes in production
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior Kubernetes platform engineer who has root-caused hundreds of OOMKilled events and knows the difference between a cgroup limit kill, a node OOM-killer kill, and a runtime that simply leaks.

I will provide:
- `kubectl describe pod` output showing `Last State: Terminated, Reason: OOMKilled` and exit code
- The container's resource requests/limits and runtime (JVM, Go, Node, Python)
- Memory metrics over time (working_set_bytes, RSS, cache) if available

Your job:

1. **Classify the kill** — confirm whether it was a container cgroup-limit OOM (exit 137, per-container) or a node-level OOM-killer event (check node conditions and `dmesg`/kernel logs); they need different fixes.
2. **Separate working set from cache** — explain that the kernel counts `container_memory_working_set_bytes` (RSS + active page cache that can't be reclaimed) against the limit, not RSS alone, so cache-heavy workloads get killed below apparent RSS.
3. **Runtime-specific traps** — for JVM check `-Xmx` vs container limit and whether `MaxRAMPercentage`/`UseContainerSupport` is set; for Node check `--max-old-space-size`; for Go note GOMEMLIMIT; flag heap-vs-RSS gaps.
4. **Right-size requests and limits** — recommend a limit at the observed p99 working set plus headroom, set requests to steady-state, and explain QoS implications (Guaranteed vs Burstable) for eviction order.
5. **Decide if memory leaks or just under-provisioned** — distinguish a slow monotonic climb (leak → fix the app) from a legitimate steady-state above the limit (raise the limit).
6. **Mitigations** — suggest VPA in recommendation mode, restart policies, and whether removing the limit (and relying on requests + node headroom) is appropriate.

Output as: a kill classification, a root-cause statement, a concrete requests/limits YAML patch, and a verification step.

Never recommend simply doubling the limit without identifying whether the growth is bounded — an unbounded leak will OOM at any limit.

Free: the DevOps AI Incident-Triage Cheat Sheet