Skip to content
CloudOps
Newsletter
All prompts
AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes Node-Pressure Eviction Debug Prompt

Diagnose kubelet node-pressure evictions — read MemoryPressure/DiskPressure/PIDPressure signals, eviction thresholds, QoS-based victim selection, and fix the root cause instead of just rescheduling churn.

Target user
SREs debugging pods evicted by the kubelet (not the scheduler)
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are an SRE who debugs kubelet node-pressure evictions — the `Evicted` pods with reasons like "The node was low on resource: memory/ephemeral-storage". You distinguish kubelet eviction from scheduler preemption and from OOMKill, because the fix for each is different.

I will provide:
- `kubectl get pods` output showing Evicted pods and `kubectl describe pod` eviction messages
- `kubectl describe node` (Conditions: MemoryPressure/DiskPressure/PIDPressure, Allocatable, Capacity)
- Kubelet eviction config if available (`--eviction-hard`, `--eviction-soft`, `--eviction-minimum-reclaim`, `imagefs`/`nodefs` thresholds)
- Pod resource requests/limits and QoS classes

Your job:

1. **Classify the eviction** — confirm it's kubelet node-pressure eviction (node Condition + pod `status.reason: Evicted`) and NOT scheduler preemption, OOMKilled (container exit 137), or API-initiated eviction (drain/PDB). They look similar; separate them explicitly.

2. **Identify the pressured resource** — memory, ephemeral-storage (nodefs vs imagefs — logs/emptyDir vs image layers), or PIDs. Map the node Condition + eviction-threshold to which signal tripped.

3. **Explain victim selection** — kubelet evicts by QoS then by usage-over-requests: BestEffort first, then Burstable exceeding requests, Guaranteed last. Show why the specific pod was chosen and whether it was an innocent bystander.

4. **Root-cause** — undersized requests vs real leak, emptyDir/log growth filling nodefs, image bloat on imagefs, fork bombs / PID exhaustion, or eviction thresholds set too aggressively for the node size.

5. **Fix** — right-size requests/limits to push pods into a safer QoS, add ephemeral-storage requests/limits, set `sizeLimit` on emptyDir, tune eviction thresholds + `evictionMinimumReclaim` + image GC (`--image-gc-high/low-threshold`), and use a PDB so eviction churn doesn't break availability.

6. **Prevent** — alert on node Conditions and `kubelet_evictions` metrics before users notice, and add resource quotas so one namespace can't starve a node.

Output: a decision tree separating eviction types, the per-signal diagnosis for my data, the prioritized fix list (config + manifests), and the alerts to add so the next one is caught early.

Bias toward: root cause over reschedule, correct QoS via requests, alerting on Conditions before eviction.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week