Debugging Kubernetes Service Connectivity With an AI Copilot

“Connection refused” is the most ambiguous error in Kubernetes. The same message comes back whether the Service has no endpoints, the pod isn’t ready, a NetworkPolicy is dropping the packet, DNS resolved wrong, or the app simply isn’t listening on the port you think. Each cause has a different fix, and the only way to tell them apart is to walk the request path layer by layer — which is exactly the methodical, tedious work I now hand to an AI copilot.

The model is a fast junior engineer who’s debugged this path a thousand times. It knows the checklist cold and reads kubectl output faster than I do. It stays read-only: it tells me which command to run next and interprets the output, while I’m the one running commands and making any change to the cluster.

Walk the path, top to bottom

The request path inside a cluster is roughly: DNS → Service → Endpoints → Pod → container port → app listening. A break anywhere returns the same vague error, so I gather one piece of evidence per layer and let the model assemble the picture. The single most diagnostic command is checking endpoints:

kubectl get endpoints my-svc -n prod

If that’s empty, the Service has no backends and nothing downstream matters. I paste it with context:

The Service my-svc returns connection refused. Its endpoints list is empty. Here’s the Service spec and the pod labels. Why are there no endpoints?

An empty endpoints list almost always means the Service selector doesn’t match any ready pod’s labels — and the model spots a label typo instantly.

Selector mismatch is the number one cause

Most “no endpoints” cases are a selector that doesn’t match. I give the model both sides:

kubectl get svc my-svc -n prod -o yaml | grep -A3 selector
kubectl get pods -n prod --show-labels

The Service selector is app=payments,tier=backend. Here are my pod labels. Which pods should match, and do any actually match?

The model does the set-matching that’s easy to get wrong by eye — a Service selecting tier: backend when the pods are labeled tier: api produces zero endpoints and a very confusing afternoon. It’s a trivial diff for the model and a classic human blind spot.

Endpoints exist but it still refuses

If endpoints are present, the break is lower: the pod isn’t ready, the container isn’t listening, or the port mapping is wrong. The readiness gate is the usual culprit — an unready pod is pulled from endpoints:

kubectl get pods -n prod -o wide
kubectl get endpoints my-svc -n prod -o yaml

I ask the model to cross-check:

These pods are Running but the endpoints subset only lists one of three. Why would a Running pod be excluded from endpoints?

The answer — readiness probe failing, so the pod is Running but not Ready, so it’s not in the endpoints — is something the model reaches for immediately, then it points me at the probe definition to confirm.

Pro Tip: targetPort is the field that quietly breaks everything. The Service port is what clients hit; targetPort is the container port traffic forwards to. If targetPort is 80 but your app listens on 8080, endpoints look healthy and connections still refuse. Always ask the model to verify targetPort against the container’s actual listening port.

Test from inside, the right way

When the YAML looks correct but it still fails, you test from inside the cluster. I spin up a throwaway debug pod and check each hop, then feed the output to the model:

kubectl run nettest --rm -it --image=nicolaka/netshoot --restart=Never -- \
  bash -c "nslookup my-svc && curl -v http://my-svc:80"

The netshoot image has every networking tool. I paste the full output:

Here’s DNS resolution and a curl from inside the cluster to the Service. DNS resolves to the ClusterIP but curl hangs. What layer is broken?

A hang (rather than refuse) after good DNS usually means a NetworkPolicy is dropping the packet — and the model knows to ask for the policies next.

NetworkPolicy is the silent dropper

If you run default-deny NetworkPolicies, a missing ingress rule drops traffic with no error at all — just a timeout. I dump the relevant policies:

kubectl get networkpolicy -n prod -o yaml

These NetworkPolicies are in the namespace. The client pod has labels app=frontend. Is there a policy that would block it from reaching pods labeled app=payments on port 8080?

The model traces the podSelector and ingress rules and tells me whether the frontend is allowed. That policy reasoning is fiddly enough by hand that I’m grateful to offload it.

The model diagnoses, you fix

Every command here is read-only — get, describe, a throwaway curl. The model never runs them and never touches the fix. When the diagnosis lands on “the Service targetPort is wrong,” I edit the manifest and apply it after reviewing the diff. The AI doesn’t get a kubeconfig and doesn’t get to kubectl edit svc on its own. Its job is to compress the layer-by-layer walk from twenty minutes to two and tell me which knob to turn; turning it is mine.

For the live version of this triage, the incident response dashboard wraps the connectivity walk in an auditable flow.

kube-proxy and the cases that aren’t your fault

Most connectivity failures are configuration — a selector, a port, a policy. But a stubborn minority live below your manifests, in kube-proxy and the CNI. When endpoints are correct, the pod is ready, the policy allows it, and traffic still doesn’t flow, the problem may be that kube-proxy isn’t programming the iptables or IPVS rules that route the ClusterIP. I give the model the lower-level evidence:

kubectl get pods -n kube-system -l k8s-app=kube-proxy
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50

Endpoints are healthy and the NetworkPolicy allows the traffic, but the ClusterIP is unreachable from another pod. Here are the kube-proxy pods and logs. Could this be a kube-proxy or CNI issue rather than my config, and how do I confirm it?

The model knows to check whether kube-proxy is even running on the node hosting the client pod, and to suggest testing pod-IP-to-pod-IP directly (bypassing the Service) to isolate whether the break is in routing or in Service programming. Knowing when to stop blaming your own YAML and look at the platform is a real skill, and the model is a useful second opinion on when you’ve crossed that line.

Build a reusable connectivity-check prompt

Because this walk is the same every time, I keep it as a saved prompt rather than re-explaining the path in each incident. The prompt lays out the layers — DNS, Service, endpoints, readiness, targetPort, NetworkPolicy, kube-proxy — and asks the model to tell me, given whatever outputs I paste, which layer to investigate next and the single command to confirm it. Turning the methodology into a template means I get consistent, ordered triage even at 3 a.m. when I’m not thinking clearly. The prompt library has connectivity-debugging templates along these lines, and the prompt workspace is where I keep and refine my own. A good saved prompt is worth more than remembering the whole checklist under pressure.

Conclusion

“Connection refused” is ambiguous because every layer fails the same way. The fix is to walk the path — DNS, Service, endpoints, pod, port, policy — and that methodical checklist is exactly what an AI copilot accelerates. It reads endpoints, catches selector and targetPort mismatches, interprets netshoot output, and reasons about NetworkPolicy drops. Keep it read-only: it tells you the broken layer, you run the commands and own the fix. That turns the most frustrating error in Kubernetes into a fast, structured diagnosis.

For deeper networking, troubleshooting Kubernetes DNS and service networking and Kubernetes network policies: default deny and beyond go further than any chat session can.