Kubernetes ProxyTerminatingEndpoints Zero-Drop Rollout Prompt
Diagnose connection drops during rollouts and node drains caused by traffic routed away from terminating pods, and fix them with terminating-endpoint routing and preStop drains.
- Target user
- Engineers chasing 5xx spikes during deploys
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes networking engineer who has chased the "we drop a few hundred requests every deploy" bug to its root in endpoint state. I want a systematic diagnosis and fix. I will provide: - The Service and Deployment specs (ports, readiness/preStop, terminationGracePeriodSeconds) - How traffic reaches the service (ClusterIP, LoadBalancer with externalTrafficPolicy, ingress) - The symptom timing relative to rollout/drain Your job: 1. **Walk the termination sequence**: when a pod enters Terminating, the endpoint controller marks the EndpointSlice entry `ready: false` and `terminating: true`, kube-proxy stops sending NEW traffic, but in-flight connections and the pod's own `preStop`/SIGTERM handling determine whether requests are dropped. 2. **Explain ProxyTerminatingEndpoints**: with externalTrafficPolicy: Local and a node draining, kube-proxy can route to *terminating but still-serving* local endpoints rather than blackholing, preventing drops when the last ready endpoint on a node is going away. 3. **Audit the pod's drain hygiene**: a `preStop` sleep long enough to outlast endpoint propagation, SIGTERM handling that stops accepting new connections but finishes in-flight ones, and `terminationGracePeriodSeconds` that exceeds preStop + drain. 4. **Check the ordering trap**: SIGTERM and EndpointSlice update race; the standard fix is a `preStop` sleep so the pod keeps serving until kube-proxy across all nodes has observed the removal. 5. **Cover externalTrafficPolicy**: Local preserves client IP but blackholes if the only local endpoint terminates without ProxyTerminatingEndpoints; Cluster spreads but SNATs. 6. **Produce the fixed manifests** with preStop, grace period, and any feature/Service field changes, plus a way to reproduce and verify (continuous curl through a rollout). 7. **Mark DESTRUCTIVE** any grace-period or drain change that could leave pods hanging on shutdown. Output format: the termination timeline, root-cause call-out, fixed YAML, and a verification loop. Do not apply changes — give me the rollout test to run. --- Service + Deployment: ```yaml [PASTE] ``` Traffic path: [DESCRIBE] Symptom timing: [DESCRIBE]
Why this prompt works
“We drop a handful of requests on every deploy” is one of the most under-diagnosed Kubernetes problems because the cause lives in a race nobody watches: the gap between a pod receiving SIGTERM and every node’s kube-proxy learning that the pod’s endpoint is gone. During that window the pod has stopped serving (or been killed) while some proxies still send it traffic. The fix is rarely “add more replicas” — it’s drain hygiene plus, in the externalTrafficPolicy: Local case, routing to terminating-but-serving endpoints instead of blackholing.
This prompt works because it makes the assistant lay out the termination timeline explicitly before proposing anything, which is the only way to see where the drop happens. It ties together the three knobs that actually matter — preStop, terminationGracePeriodSeconds, and SIGTERM handling — and forces them into the correct ordering relationship rather than treating them as independent tunables. The externalTrafficPolicy and ProxyTerminatingEndpoints section catches the subtler single-node-endpoint blackhole that a generic “graceful shutdown” answer would miss entirely.
The verification loop is what turns it from theory into something you can trust: a continuous curl through a rollout either drops connections or it doesn’t. Keep the changes human-applied, since grace-period mistakes can leave pods hanging on shutdown. More service-networking debugging lives in the Kubernetes & Helm guides and the prompt library.
Related prompts
-
EndpointSlice & Service Discovery Debug Prompt
Debug Services that route to no pods or stale pods — empty EndpointSlices, failing readiness gates, selector mismatches, and headless/StatefulSet DNS resolution.
-
Kubernetes LoadBalancer / NodePort Service Debug Prompt
Diagnose LoadBalancer service issues — stuck Pending, externalTrafficPolicy: Local pitfalls, source IP preservation, cloud provider quirks, NodePort range collisions.
-
Kubernetes Pod Lifecycle & Graceful Shutdown Prompt
Design and debug pod lifecycle — preStop hooks, terminationGracePeriodSeconds, SIGTERM handling, connection draining, readiness probe behavior on shutdown.