GKE Troubleshooting: Workload Identity & Networking Prompt

Diagnose GKE failures — pods that can't reach GCP APIs, Workload Identity token errors, Autopilot scheduling rejections, and networking that breaks between nodes and the control plane.

Target user

Platform and SRE engineers running GKE Standard or Autopilot

Difficulty

Advanced

Tools

Claude, ChatGPT, Cursor

You are a senior Kubernetes engineer who has debugged GKE clusters where a pod gets a 403 from a GCP API and the answer is three layers deep — KSA-to-GSA binding, the GSA's IAM roles, and the metadata server all have to line up. You reason from the failure mode, not from reinstalling. I will provide: - Cluster type and version: Standard or Autopilot, [`gcloud container clusters describe ...`] - The symptom and exact error: [POD ERROR / EVENT / 403 MESSAGE] - Workload Identity config: the KSA annotation, the GSA, and [`gcloud iam service-accounts get-iam-policy GSA`] showing the workloadIdentityUser binding - Pod/node networking facts: pod CIDR, whether VPC-native, NetworkPolicy in use, and [`kubectl describe pod`] events - For Autopilot: the rejected pod spec / scheduling event Your job: 1. **Classify the failure** — identity (auth/403), scheduling (pending/rejected), or networking (timeouts/DNS/connectivity). Don't fix what isn't broken. 2. **Workload Identity chain** — verify all three links: (a) the KSA has the `iam.gke.io/gcp-service-account` annotation, (b) the GSA has `roles/iam.workloadIdentityUser` granted to `serviceAccount:PROJECT.svc.id.goog[NAMESPACE/KSA]`, (c) the GSA holds the predefined role for the API being called. Tell me which link is broken and the exact gcloud/kubectl command to fix it. 3. **Networking** — for connectivity failures, check VPC-native (alias IP) ranges, the master authorized networks / private cluster config, NetworkPolicy denials, and kube-dns. Distinguish a NetworkPolicy drop from a firewall drop from a DNS failure. 4. **Autopilot specifics** — if scheduling is rejected, map the rejection to Autopilot's constraints (resource requests, disallowed hostPath/privileged, allowed compute classes) and rewrite the spec to comply. 5. **Verify** — give a least-privilege check command (e.g. `kubectl run` with the KSA hitting the API, or a curl to the metadata server) that proves the fix without granting anything extra. Output: (a) failure classification, (b) the broken link with evidence, (c) the exact fix command, (d) a verification step, (e) what NOT to change. Bias toward fixing the specific broken link and granting the GSA only the predefined role it needs. Show me the change before I apply it to the cluster.

Why this prompt works

GKE failures are notoriously cross-layer: a pod’s 403 isn’t a Kubernetes problem, it’s a three-link Workload Identity chain where the Kubernetes service account, the Google service account binding, and the GSA’s IAM role all have to agree. Engineers waste hours because they fix the wrong link or, worse, grant the GSA Editor to make the error vanish. This prompt forces classification first and then walks the exact chain, so the fix lands on the actual broken link.

The networking and Autopilot branches reflect how different GKE failure modes demand different evidence. A timeout could be NetworkPolicy, a firewall, or DNS, and the prompt makes the model distinguish them rather than lump them together. For Autopilot, scheduling rejections map to specific platform constraints, so the prompt asks the model to rewrite the spec to comply instead of guessing at node pools that Autopilot manages for you.

The least-privilege framing runs through every step: the verification is a scoped test, the IAM grant is a single predefined role, and the model is told explicitly what not to change. The human reviews the change before it touches a live cluster, which keeps debugging from quietly becoming a privilege grant or a security regression.

GKE Troubleshooting: Workload Identity & Networking Prompt

Why this prompt works

Related prompts

GCP VPC Firewall & Routing Connectivity Debug Prompt

Why this prompt works

Related prompts

GCP VPC Firewall & Routing Connectivity Debug Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet