Troubleshooting EKS With AI: IRSA, Networking, and Scheduling
EKS failures span Kubernetes and AWS at once. Here's how to use AI to triage IRSA, CNI networking, and pod scheduling problems without guessing across layers.
- #aws
- #ai
- #eks
- #kubernetes
- #troubleshooting
The thing that makes EKS uniquely painful to debug is that a single failure lives in two worlds at once. A pod that won’t start could be a Kubernetes scheduling problem, or an AWS IAM trust-policy problem, or a VPC CNI IP-exhaustion problem, and the error messages rarely tell you which world to look in. I spent a frustrating afternoon last quarter on a pod stuck in CreateContainerError that turned out to be an IRSA misconfiguration — the kind of cross-boundary bug where you check Kubernetes RBAC for an hour before realizing the actual problem was an AWS OIDC condition with a typo.
EKS is where AI-assisted debugging shines hardest precisely because the problem spans layers. A model can hold the Kubernetes manifest, the IAM trust policy, and the CNI config in context at the same time and tell you which boundary the failure is crossing — something I struggle to do under pressure. As always, it’s a reasoning partner, not an actor: it points at the layer, I confirm with a real command, I make the fix.
Triage starts with the symptom shape, not the logs
Before pasting anything into a model, classify the failure, because each class has a different evidence set. Pod stuck Pending is almost always scheduling or IP exhaustion. Pod running but getting AccessDenied on AWS calls is IRSA. Pods that can’t reach each other or the internet is CNI/networking. Get that wrong and you waste the model’s context on irrelevant data.
kubectl get pods -n payments -o wide
kubectl describe pod payments-api-xxxx -n payments
kubectl get events -n payments --sort-by='.lastTimestamp' | tail -20
The describe output and recent events are the highest-signal artifacts. That’s what goes to AI first.
IRSA: the trust policy is usually the liar
IRSA bugs are my most common EKS support load, and they’re nasty because the Kubernetes side looks perfect — the service account exists, the annotation is there — while the AWS side silently rejects the token. Gather both halves:
# Kubernetes side: the SA annotation
kubectl get sa payments-api -n payments \
-o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}'
# AWS side: the role's trust policy
aws iam get-role --role-name payments-api-irsa \
--query 'Role.AssumeRolePolicyDocument'
Then hand both to the model with a prompt that knows the exact mechanics of how IRSA tokens get validated:
This is an EKS IRSA setup. The pod’s service account has the role-arn annotation below, and the role’s trust policy is below. IRSA works by the pod presenting a projected OIDC token that AWS validates against this trust policy’s Federated principal and StringEquals conditions. Check: (1) does the trust policy’s
:subcondition exactly matchsystem:serviceaccount:<namespace>:<serviceaccountname>, (2) does it include the:aud=sts.amazonaws.comcondition, (3) does the OIDC provider ARN in the principal match the cluster’s actual OIDC issuer. Point out any mismatch character by character.
The model nailed it:
The trust policy’s sub condition is
system:serviceaccount:payment:payments-apibut the pod runs in namespacepayments. The condition reads “payment” (singular). AWS rejects the AssumeRoleWithWebIdentity because the projected token’s sub claim issystem:serviceaccount:payments:payments-api. Fix the trust policy namespace.
A missing s. I would have read past that fifteen times. Here’s the corrected condition:
{
"Condition": {
"StringEquals": {
"oidc.eks.us-east-1.amazonaws.com/id/EXAMPLE:sub": "system:serviceaccount:payments:payments-api",
"oidc.eks.us-east-1.amazonaws.com/id/EXAMPLE:aud": "sts.amazonaws.com"
}
}
}
Confirm the fix with a real check rather than assuming — exec into the pod and actually make an AWS call:
kubectl exec -it payments-api-xxxx -n payments -- aws sts get-caller-identity
If that returns the assumed-role ARN, you’re done. The model found the layer; the live call proved the fix.
CNI networking and IP exhaustion
The VPC CNI gives every pod a real VPC IP, which means a busy node can exhaust the ENIs’ IP capacity and new pods stick in ContainerCreating with no obvious cause. Pull the CNI’s view:
kubectl logs -n kube-system -l k8s-app=aws-node --tail=50 | grep -i "ip\|eni"
# Subnet free IPs for the node's subnet
aws ec2 describe-subnets --subnet-ids subnet-eks-a \
--query 'Subnets[0].AvailableIpAddressCount'
Feed the aws-node logs plus the subnet free-IP count to AI and ask whether the symptom is IP exhaustion versus a CNI config issue. It’s good at distinguishing “no IPs available in the subnet” (fix: add a secondary CIDR or enable prefix delegation) from “ENI attach failing on IAM perms” (fix: the node role). It’ll also recommend prefix delegation with the exact env var (ENABLE_PREFIX_DELEGATION=true) — which you should confirm against the CNI version you’re running before applying, because it’s version-gated.
Scheduling: read the describe, not your assumptions
For a Pending pod, the describe events tell the whole story, and AI is excellent at translating cryptic scheduler messages.
kubectl describe pod payments-api-xxxx -n payments | grep -A10 Events
0/6 nodes are available: 3 Insufficient cpu, 2 node(s) had untolerated taint {dedicated: gpu}, 1 node(s) didn’t match Pod’s node affinity/selector.
Paste that and ask the model to break down each reason and what to change. It’ll tell you that you need either smaller CPU requests, a toleration for the GPU taint, or relaxed node affinity — and crucially it’ll tell you which is appropriate for your manifest if you paste that too. The judgment call (do I really want this pod on the GPU nodes?) stays with you.
The control line
EKS troubleshooting with AI works because the model can reason across the Kubernetes/AWS boundary that trips humans up, and because so many of these bugs are a single wrong character in a config you’re staring at. But it can’t see your cluster state — it sees what you paste. So the loop is: classify the symptom, gather the right evidence for that class, let AI locate the broken layer, then prove the fix with a live kubectl exec or get-caller-identity before you call it done.
For the IAM and OIDC pieces specifically, the same trust-policy rigor shows up in writing least-privilege IAM policies with AI, and I keep my EKS triage prompts alongside the rest in the prompts collection.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.