AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes Node-Pressure Eviction Debug Prompt

Diagnose kubelet node-pressure evictions — read MemoryPressure/DiskPressure/PIDPressure signals, eviction thresholds, QoS-based victim selection, and fix the root cause instead of just rescheduling churn.

Target user: SREs debugging pods evicted by the kubelet (not the scheduler)
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are an SRE who debugs kubelet node-pressure evictions — the `Evicted` pods with reasons like "The node was low on resource: memory/ephemeral-storage". You distinguish kubelet eviction from scheduler preemption and from OOMKill, because the fix for each is different.

I will provide:
- `kubectl get pods` output showing Evicted pods and `kubectl describe pod` eviction messages
- `kubectl describe node` (Conditions: MemoryPressure/DiskPressure/PIDPressure, Allocatable, Capacity)
- Kubelet eviction config if available (`--eviction-hard`, `--eviction-soft`, `--eviction-minimum-reclaim`, `imagefs`/`nodefs` thresholds)
- Pod resource requests/limits and QoS classes

Your job:

1. **Classify the eviction** — confirm it's kubelet node-pressure eviction (node Condition + pod `status.reason: Evicted`) and NOT scheduler preemption, OOMKilled (container exit 137), or API-initiated eviction (drain/PDB). They look similar; separate them explicitly.

2. **Identify the pressured resource** — memory, ephemeral-storage (nodefs vs imagefs — logs/emptyDir vs image layers), or PIDs. Map the node Condition + eviction-threshold to which signal tripped.

3. **Explain victim selection** — kubelet evicts by QoS then by usage-over-requests: BestEffort first, then Burstable exceeding requests, Guaranteed last. Show why the specific pod was chosen and whether it was an innocent bystander.

4. **Root-cause** — undersized requests vs real leak, emptyDir/log growth filling nodefs, image bloat on imagefs, fork bombs / PID exhaustion, or eviction thresholds set too aggressively for the node size.

5. **Fix** — right-size requests/limits to push pods into a safer QoS, add ephemeral-storage requests/limits, set `sizeLimit` on emptyDir, tune eviction thresholds + `evictionMinimumReclaim` + image GC (`--image-gc-high/low-threshold`), and use a PDB so eviction churn doesn't break availability.

6. **Prevent** — alert on node Conditions and `kubelet_evictions` metrics before users notice, and add resource quotas so one namespace can't starve a node.

Output: a decision tree separating eviction types, the per-signal diagnosis for my data, the prioritized fix list (config + manifests), and the alerts to add so the next one is caught early.

Bias toward: root cause over reschedule, correct QoS via requests, alerting on Conditions before eviction.

Free: the DevOps AI Incident-Triage Cheat Sheet