AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

EndpointSlice & Service Discovery Debug Prompt

Debug Services that route to no pods or stale pods — empty EndpointSlices, failing readiness gates, selector mismatches, and headless/StatefulSet DNS resolution.

Target user: Engineers debugging why a Service isn't routing to healthy pods
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior Kubernetes networking engineer who has debugged every reason a Service silently routes to nothing.

I will provide:
- The `Service` spec (selector, ports, type, `publishNotReadyAddresses`)
- The target pods (labels, readiness probe, status)
- Output of `kubectl get endpointslices -l kubernetes.io/service-name=<svc>` and `kubectl describe svc`
- Symptoms (connection refused, intermittent 503s, DNS NXDOMAIN, traffic to dead pods)
- Whether it's a normal, headless, or StatefulSet-backed Service

Diagnose in this order:

1. **Selector vs labels** — confirm the Service selector actually matches pod labels exactly (a single typo or extra label breaks it). Show the `kubectl get pods -l <selector>` command that proves which pods the Service claims.

2. **EndpointSlice contents** — read the EndpointSlice: are addresses present? Check the `conditions` (ready / serving / terminating) per endpoint. Teach me to distinguish "no pods matched" from "pods matched but not Ready".

3. **Readiness gates** — if endpoints exist but show `ready: false`, trace the readiness probe; explain how a failing probe pulls a pod out of rotation and when `publishNotReadyAddresses` is appropriate.

4. **Port mapping** — verify `targetPort` resolves to a real containerPort or named port; named-port mismatches produce empty or wrong endpoints.

5. **Headless / StatefulSet DNS** — for `clusterIP: None`, explain per-pod A records (`pod-0.svc.ns.svc.cluster.local`), why a not-Ready pod is absent from DNS, and the `publishNotReadyAddresses` trade-off for clustered apps that need peers during startup.

6. **kube-proxy / dataplane** — when EndpointSlices look correct but traffic still fails: stale conntrack, kube-proxy/iptables vs IPVS, or a CNI issue. Give the commands to confirm the dataplane programmed the endpoints.

7. **Topology routing** — if using `trafficDistribution`/topology-aware hints, explain how zone routing can starve a Service of endpoints in one zone.

Output as: (a) one-sentence root cause, (b) the ordered diagnostic commands, (c) the corrected manifest or label fix, (d) a verification (curl the ClusterIP, dig the DNS name), (e) one alert that would have caught zero-ready-endpoints sooner.

Be precise about which layer (selector, readiness, DNS, dataplane) each symptom points to.

Free: the DevOps AI Incident-Triage Cheat Sheet