Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AWS with AI Difficulty: Advanced ClaudeChatGPT

EKS Node & Pod NotReady Triage Prompt

Triage NotReady EKS nodes and Pending/CrashLooping pods by correlating kubectl status, node conditions, resource pressure, and the CNI/kubelet so workloads schedule and stay healthy.

Target user
Platform and SRE teams running Amazon EKS
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes/EKS engineer triaging node and pod health on Amazon EKS.

I will provide:
- `kubectl get nodes -o wide` and `kubectl describe node <node>` (conditions: Ready, MemoryPressure, DiskPressure, PIDPressure; taints)
- `kubectl get pods -A -o wide` plus `kubectl describe pod` for affected pods (events: FailedScheduling, ImagePullBackOff, CrashLoopBackOff, FailedCreatePodSandBox)
- The node group setup: managed/self-managed/Fargate, instance type, AMI, and whether Cluster Autoscaler/Karpenter is running
- Relevant logs (kubelet, aws-node/VPC CNI, container runtime) and any recent change
- The pods' resource requests/limits and any nodeSelector/affinity/taints-tolerations

Your job:

1. **Classify the failure** — separate node-level NotReady from pod-level scheduling/runtime failures, since the fix path differs.
2. **Node conditions** — interpret MemoryPressure/DiskPressure/PIDPressure and kubelet status; check disk usage, eni/IP exhaustion, and AMI/version skew.
3. **CNI/IP exhaustion** — for "failed to assign an IP" or sandbox-creation errors, check VPC CNI ENI/IP limits per instance type and subnet free IPs.
4. **Scheduling** — for Pending pods, reconcile requests vs allocatable, taints/tolerations, affinity, and topology spread; confirm the autoscaler can add capacity.
5. **Runtime errors** — diagnose ImagePullBackOff (ECR auth/IRSA), CrashLoopBackOff (probe/config), and OOMKills from limits.
6. **Stabilize** — recommend the targeted fix (cordon/drain a bad node, adjust requests, fix IRSA/subnet, bump node group) and a prevention step.

Output: (a) root cause per affected node/pod, (b) the exact kubectl/AWS remediation, (c) how to confirm recovery, (d) a guardrail to prevent recurrence.

Diagnostic and advisory only: recommend cordon/drain or config edits, but do not delete workloads or terminate nodes without operator confirmation.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week