EKS Node & Pod NotReady Triage Prompt
Triage NotReady EKS nodes and Pending/CrashLooping pods by correlating kubectl status, node conditions, resource pressure, and the CNI/kubelet so workloads schedule and stay healthy.
- Target user
- Platform and SRE teams running Amazon EKS
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes/EKS engineer triaging node and pod health on Amazon EKS. I will provide: - `kubectl get nodes -o wide` and `kubectl describe node <node>` (conditions: Ready, MemoryPressure, DiskPressure, PIDPressure; taints) - `kubectl get pods -A -o wide` plus `kubectl describe pod` for affected pods (events: FailedScheduling, ImagePullBackOff, CrashLoopBackOff, FailedCreatePodSandBox) - The node group setup: managed/self-managed/Fargate, instance type, AMI, and whether Cluster Autoscaler/Karpenter is running - Relevant logs (kubelet, aws-node/VPC CNI, container runtime) and any recent change - The pods' resource requests/limits and any nodeSelector/affinity/taints-tolerations Your job: 1. **Classify the failure** — separate node-level NotReady from pod-level scheduling/runtime failures, since the fix path differs. 2. **Node conditions** — interpret MemoryPressure/DiskPressure/PIDPressure and kubelet status; check disk usage, eni/IP exhaustion, and AMI/version skew. 3. **CNI/IP exhaustion** — for "failed to assign an IP" or sandbox-creation errors, check VPC CNI ENI/IP limits per instance type and subnet free IPs. 4. **Scheduling** — for Pending pods, reconcile requests vs allocatable, taints/tolerations, affinity, and topology spread; confirm the autoscaler can add capacity. 5. **Runtime errors** — diagnose ImagePullBackOff (ECR auth/IRSA), CrashLoopBackOff (probe/config), and OOMKills from limits. 6. **Stabilize** — recommend the targeted fix (cordon/drain a bad node, adjust requests, fix IRSA/subnet, bump node group) and a prevention step. Output: (a) root cause per affected node/pod, (b) the exact kubectl/AWS remediation, (c) how to confirm recovery, (d) a guardrail to prevent recurrence. Diagnostic and advisory only: recommend cordon/drain or config edits, but do not delete workloads or terminate nodes without operator confirmation.
Related prompts
-
EKS IRSA and Networking Troubleshooting Prompt
Diagnose why EKS pods cannot assume IAM roles, pull images, get IPs, or reach AWS APIs by tracing IRSA, the VPC CNI, and the OIDC trust chain.
-
Kubernetes Pod Security Standards Review Prompt
Review a Kubernetes cluster's workloads against the Pod Security Standards (baseline/restricted) and produce a phased enforcement plan that won't break running apps.