Troubleshooting GKE With AI: Workload Identity and Networking
GKE failures hide across Kubernetes, GCP IAM, and VPC layers at once. Here's how I use AI to untangle Workload Identity errors and pod networking on Google Kubernetes Engine.
- #gcp
- #ai
- #gke
- #kubernetes
- #networking
A pod was crash-looping with a 403 PermissionDenied from Cloud Storage, and the team’s first instinct was “the bucket IAM is wrong.” It wasn’t. The bucket binding was fine. The Kubernetes service account wasn’t annotated to map to the Google service account, so the pod was authenticating as the node’s default identity, which had no access to that bucket at all. This is the defining pain of GKE: a single failure spans three systems — Kubernetes RBAC, GCP IAM via Workload Identity, and the VPC-native network — and the error message only ever shows you one layer. AI is good here precisely because it can hold all three layers at once if you give it the evidence from each.
Workload Identity: the four things that must line up
Workload Identity errors are almost always a broken link in a chain of four bindings. I collect all four and let the model check the chain rather than eyeballing each annotation.
# 1. The KSA and its annotation
kubectl get serviceaccount app-ksa -n production -o yaml
# 2. The GSA's IAM policy (who can impersonate it)
gcloud iam service-accounts get-iam-policy app-gsa@my-proj.iam.gserviceaccount.com
# 3. The bucket / resource binding
gcloud storage buckets get-iam-policy gs://app-data --format=json
Prompt: “Here are three things from a GKE Workload Identity setup: a Kubernetes ServiceAccount YAML, the GCP service account IAM policy, and a bucket IAM policy. Verify the full chain: KSA annotation
iam.gke.io/gcp-service-accountpoints to the right GSA, the GSA grantsroles/iam.workloadIdentityUserto the memberserviceAccount:my-proj.svc.id.goog[production/app-ksa], and the GSA has access to the bucket. Tell me exactly which link is broken and the precise format the member string should be.”
The member string format — PROJECT.svc.id.goog[NAMESPACE/KSA_NAME] — is where people slip: wrong namespace, missing brackets, project number instead of project ID. The model catches the exact string mismatch instantly, which is faster than me diffing two near-identical lines at 11pm.
# The binding that's usually missing
gcloud iam service-accounts add-iam-policy-binding \
app-gsa@my-proj.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:my-proj.svc.id.goog[production/app-ksa]"
I verify the impersonation actually works from inside a pod before declaring victory:
kubectl run -it --rm wi-test --image=google/cloud-sdk:slim \
--overrides='{"spec":{"serviceAccountName":"app-ksa"}}' \
-n production -- gcloud auth list
If that shows the GSA, the chain is whole. AI tells me where to look; the cluster tells me if I’m right.
Pod networking: it’s usually IP exhaustion or a NetworkPolicy
VPC-native GKE assigns every pod a real VPC IP from a secondary range. When pods won’t schedule or can’t talk to each other, the cause is often a range that’s run dry or a NetworkPolicy that’s quietly dropping traffic. Gather the facts and let AI do the arithmetic:
gcloud container clusters describe prod-cluster --region=us-central1 \
--format="yaml(ipAllocationPolicy)"
kubectl get networkpolicy -A
Prompt: “This GKE cluster uses these secondary ranges for pods and services (pasted). The cluster has 40 nodes with a max of 110 pods per node. Calculate whether the pod secondary range can accommodate the worst case, accounting for GKE reserving a /24-equivalent block per node. If it’s too small, tell me the minimum CIDR size I need.”
GKE allocates a fixed block of IPs per node regardless of how many pods actually run, so a range that looks huge runs out far sooner than naive math suggests. Having AI do that calculation explicitly — with the per-node reservation baked in — has saved me from cluster expansions that would have silently failed.
For NetworkPolicy debugging, I paste the policies plus the symptom:
Prompt: “Here are the NetworkPolicies in namespace
production. A pod with labelapp=frontendcannot reach a pod with labelapp=apion port 8080. NetworkPolicies are deny-by-default once any policy selects a pod. Tell me which policy is selecting the api pod, whether an ingress rule permits the frontend, and the minimal policy edit to allow it.”
The “deny-by-default once selected” rule is the part everyone forgets — adding your first NetworkPolicy to a namespace silently blocks everything not explicitly allowed. The model reasons about selector overlap reliably, which is tedious to do by hand across a dozen policies.
Node and scheduling failures
When pods sit Pending, the events have the answer but it’s buried. I pipe it straight in:
kubectl describe pod stuck-pod -n production | head -60
Prompt: “This is
kubectl describeoutput for a Pending GKE pod. Read the Events. Tell me in one line whether this is insufficient resources, a taint/toleration mismatch, a node selector issue, or a PVC binding problem, and the single command to confirm.”
Where I keep control
GKE troubleshooting with AI works because every layer produces structured evidence — YAML, IAM policy JSON, describe output — that a model can correlate far faster than I can tab between three consoles. But the model can’t run kubectl auth can-i, can’t see the live data plane, and can’t know your security intent. So the loop is: collect evidence from all three layers, let AI tell me which link is broken, then confirm against the live cluster before I change anything. I never apply an RBAC or IAM change on the model’s say-so alone.
The reusable versions of these prompts live in my prompts collection, and the broader GCP with AI series covers the IAM and VPC pieces a GKE incident usually drags in. The cluster will tell you the truth eventually — AI just helps you ask the right three layers at once.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.