Troubleshooting GKE With AI: Workload Identity and

A pod was crash-looping with a 403 PermissionDenied from Cloud Storage, and the team’s first instinct was “the bucket IAM is wrong.” It wasn’t. The bucket binding was fine. The Kubernetes service account wasn’t annotated to map to the Google service account, so the pod was authenticating as the node’s default identity, which had no access to that bucket at all. This is the defining pain of GKE: a single failure spans three systems — Kubernetes RBAC, GCP IAM via Workload Identity, and the VPC-native network — and the error message only ever shows you one layer. AI is good here precisely because it can hold all three layers at once if you give it the evidence from each.

Workload Identity: the four things that must line up

Workload Identity errors are almost always a broken link in a chain of four bindings. I collect all four and let the model check the chain rather than eyeballing each annotation.

# 1. The KSA and its annotation
kubectl get serviceaccount app-ksa -n production -o yaml

# 2. The GSA's IAM policy (who can impersonate it)
gcloud iam service-accounts get-iam-policy app-gsa@my-proj.iam.gserviceaccount.com

# 3. The bucket / resource binding
gcloud storage buckets get-iam-policy gs://app-data --format=json

Prompt: “Here are three things from a GKE Workload Identity setup: a Kubernetes ServiceAccount YAML, the GCP service account IAM policy, and a bucket IAM policy. Verify the full chain: KSA annotation iam.gke.io/gcp-service-account points to the right GSA, the GSA grants roles/iam.workloadIdentityUser to the member serviceAccount:my-proj.svc.id.goog[production/app-ksa], and the GSA has access to the bucket. Tell me exactly which link is broken and the precise format the member string should be.”

The member string format — PROJECT.svc.id.goog[NAMESPACE/KSA_NAME] — is where people slip: wrong namespace, missing brackets, project number instead of project ID. The model catches the exact string mismatch instantly, which is faster than me diffing two near-identical lines at 11pm.

# The binding that's usually missing
gcloud iam service-accounts add-iam-policy-binding \
  app-gsa@my-proj.iam.gserviceaccount.com \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:my-proj.svc.id.goog[production/app-ksa]"

I verify the impersonation actually works from inside a pod before declaring victory:

kubectl run -it --rm wi-test --image=google/cloud-sdk:slim \
  --overrides='{"spec":{"serviceAccountName":"app-ksa"}}' \
  -n production -- gcloud auth list

If that shows the GSA, the chain is whole. AI tells me where to look; the cluster tells me if I’m right.

Pod networking: it’s usually IP exhaustion or a NetworkPolicy

VPC-native GKE assigns every pod a real VPC IP from a secondary range. When pods won’t schedule or can’t talk to each other, the cause is often a range that’s run dry or a NetworkPolicy that’s quietly dropping traffic. Gather the facts and let AI do the arithmetic:

gcloud container clusters describe prod-cluster --region=us-central1 \
  --format="yaml(ipAllocationPolicy)"
kubectl get networkpolicy -A

Prompt: “This GKE cluster uses these secondary ranges for pods and services (pasted). The cluster has 40 nodes with a max of 110 pods per node. Calculate whether the pod secondary range can accommodate the worst case, accounting for GKE reserving a /24-equivalent block per node. If it’s too small, tell me the minimum CIDR size I need.”

GKE allocates a fixed block of IPs per node regardless of how many pods actually run, so a range that looks huge runs out far sooner than naive math suggests. Having AI do that calculation explicitly — with the per-node reservation baked in — has saved me from cluster expansions that would have silently failed.

For NetworkPolicy debugging, I paste the policies plus the symptom:

Prompt: “Here are the NetworkPolicies in namespace production. A pod with label app=frontend cannot reach a pod with label app=api on port 8080. NetworkPolicies are deny-by-default once any policy selects a pod. Tell me which policy is selecting the api pod, whether an ingress rule permits the frontend, and the minimal policy edit to allow it.”

The “deny-by-default once selected” rule is the part everyone forgets — adding your first NetworkPolicy to a namespace silently blocks everything not explicitly allowed. The model reasons about selector overlap reliably, which is tedious to do by hand across a dozen policies.

Node and scheduling failures

When pods sit Pending, the events have the answer but it’s buried. I pipe it straight in:

kubectl describe pod stuck-pod -n production | head -60

Prompt: “This is kubectl describe output for a Pending GKE pod. Read the Events. Tell me in one line whether this is insufficient resources, a taint/toleration mismatch, a node selector issue, or a PVC binding problem, and the single command to confirm.”

Where I keep control

GKE troubleshooting with AI works because every layer produces structured evidence — YAML, IAM policy JSON, describe output — that a model can correlate far faster than I can tab between three consoles. But the model can’t run kubectl auth can-i, can’t see the live data plane, and can’t know your security intent. So the loop is: collect evidence from all three layers, let AI tell me which link is broken, then confirm against the live cluster before I change anything. I never apply an RBAC or IAM change on the model’s say-so alone.

The reusable versions of these prompts live in my prompts collection, and the broader GCP with AI series covers the IAM and VPC pieces a GKE incident usually drags in. The cluster will tell you the truth eventually — AI just helps you ask the right three layers at once.

Troubleshooting GKE With AI: Workload Identity and Networking

Workload Identity: the four things that must line up

Pod networking: it’s usually IP exhaustion or a NetworkPolicy

Node and scheduling failures

Where I keep control

Download the Free 500-Prompt DevOps AI Toolkit