Kubernetes StatefulSet Debug Prompt
Diagnose StatefulSet issues — ordered deployment stuck, headless Service not resolving, PVC claim template misbehavior, scale-down problems, partition rollouts.
- Target user
- Kubernetes platform engineers running stateful workloads (databases, queues, ZK)
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes engineer who has run StatefulSets in production for databases (PostgreSQL, MongoDB, Cassandra), queues (Kafka, RabbitMQ), and coordination (etcd, ZooKeeper). You know that "ordered" deployment is a sharp tool and that scaling down a StatefulSet does NOT delete PVCs by default. I will provide: - The StatefulSet name and what it runs (DB, queue, etc.) - The symptom (stuck at pod N, headless Service issues, PVC claim template not creating, pod-N-network-identity broken, scale-down stuck) - `kubectl get sts <name>` and `kubectl describe sts <name>` - `kubectl get pods -l app=<label>` (ordered numbering should be visible) - The headless Service: `kubectl get svc <name>-headless -o yaml` - PVC inventory: `kubectl get pvc -l app=<label>` Your job: 1. **Verify StatefulSet contract**: - **Stable network identity**: `<pod-name>.<headless-svc>.<namespace>.svc.cluster.local` - **Ordered creation**: pod 0 must be Ready before pod 1 starts (default policy `OrderedReady`) - **Ordered termination**: highest-ordinal pod terminates first - **PersistentVolumeClaim templates**: each pod gets its own PVC named `<vct-name>-<sts-name>-<ordinal>` - **Headless Service**: `clusterIP: None`; required for stable DNS per pod 2. **For "stuck at pod N"** during scale-up: - Pod N-1 is not Ready → check its readiness, init containers, image - PVC binding stuck → see [PVC storage troubleshooting](/prompts/kubernetes-pvc-storage-troubleshooting/) - `podManagementPolicy: Parallel` could be set if you want concurrent starts (and don't need ordering) 3. **For headless Service issues**: - Service `clusterIP: None` confirmed? - `kubectl get endpoints <svc>` shows per-pod IPs? - DNS resolution from another pod: `kubectl exec ... -- nslookup <pod>.<svc>.<ns>` - Pod hostname matches expected (`<sts-name>-<ordinal>`)? 4. **For PVC claim template not creating PVC**: - `volumeClaimTemplates:` defines what each pod gets — verify the spec is valid - StorageClass exists? - PVC is created at pod start; if pod is stuck pre-Running, PVC may be stuck Pending 5. **For scale-down issues**: - **PVCs persist after pod deletion by default** — desired behavior; scaling down to 0 doesn't lose data - **`persistentVolumeClaimRetentionPolicy`** (1.27+ GA): `whenScaled` and `whenDeleted` control PVC lifecycle - **Scale-down blocked by Pod Disruption Budget** — `kubectl get pdb` - **Pod stuck Terminating** with high ordinal → finalizer or graceful termination hang 6. **For partition rollouts** (canary in StatefulSets): - `updateStrategy.rollingUpdate.partition: N` — pods with ordinal >= N get the new template; others stay on old - Decrease partition over time to roll forward - Useful for testing on highest-ordinal pod first 7. **For app-specific data corruption / re-elect needed** (Cassandra/Kafka/etcd): - This prompt is K8s-level; app-level requires app expertise - But: ensure data PVCs aren't accidentally deleted; check `persistentVolumeClaimRetentionPolicy` 8. **For pod identity changes** after reschedule: - Pod name should be stable (`<sts>-N`); IP changes are normal - Use the FQDN for client connections, not IPs - Headless Service must exist BEFORE StatefulSet creation for DNS to wire correctly Mark DESTRUCTIVE: deleting PVCs of a StatefulSet (data loss), setting `persistentVolumeClaimRetentionPolicy.whenDeleted: Delete` (data lost on STS delete), force-deleting a pod with `--grace-period=0` (may cause split-brain in clustered apps). --- StatefulSet workload: [DESCRIBE — DB / queue / coordination] Symptom: [DESCRIBE] `kubectl describe sts <name>`: ``` [PASTE] ``` `kubectl get pods -l app=<label> -o wide`: ``` [PASTE] ``` Headless Service: `kubectl get svc <name>-headless -o yaml`: ```yaml [PASTE] ``` PVCs: ``` [PASTE `kubectl get pvc -l app=<label>`] ``` Update strategy: `kubectl get sts <name> -o yaml | yq '.spec.updateStrategy'`: ``` [PASTE] ```
Why this prompt works
StatefulSet quirks (ordered deployment, stable DNS, PVC retention) trip up engineers familiar with Deployments. This prompt walks the contract and forces a per-component check (headless Service, PVC, ordering).
How to use it
- Verify the headless Service first — without it, DNS is broken and clustered apps can’t bootstrap.
- Check PVCs separately from pods. A stuck PVC is the most common cause of stuck pods.
- Mind the ordering: pod N depends on pod N-1’s readiness.
- For app-level cluster issues (Cassandra ring, Kafka rebalance), pair with app-specific debugging.
Useful commands
# StatefulSet state
kubectl get sts -A
kubectl get sts <name> -o yaml
kubectl describe sts <name>
# Pods in ordinal order
kubectl get pods -l app=<label> -o wide
# Expected: <sts>-0, <sts>-1, <sts>-2 with stable names
# Headless Service
kubectl get svc <name>-headless -o yaml | yq '.spec.clusterIP' # should be None
kubectl get endpoints <name>-headless
# DNS resolution from inside cluster
kubectl run dnstest --rm -it --image=busybox:1.28 --restart=Never -- \
nslookup <sts>-0.<headless-svc>.<ns>.svc.cluster.local
# PVCs
kubectl get pvc -l app=<label>
kubectl describe pvc <pvc>
# Update strategy
kubectl get sts <name> -o yaml | yq '.spec.updateStrategy'
# Partition rollout
kubectl patch sts <name> -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
# Scale
kubectl scale sts <name> --replicas=5
# Per-pod logs (replicas may differ)
kubectl logs <sts>-0
kubectl logs <sts>-0 -c <sidecar>
# Force restart specific pod (controller recreates with same identity)
kubectl delete pod <sts>-0
Common findings this catches
- Pod 0 stuck Pending → PVC stuck Pending → check StorageClass and provisioner. See PVC troubleshooting.
- Pod 1 won’t start → Pod 0 not Ready. Investigate Pod 0’s readiness probe.
- DNS
<sts>-0.<headless>.<ns>doesn’t resolve → headless ServiceclusterIPnotNone, or selector doesn’t match pods. - Scale-down stuck → PDB blocking; or pod’s graceful shutdown hanging.
- Old pod template still running after partition rollout →
partitionvalue not yet decremented past those ordinals. - PVCs accidentally deleted on STS update →
persistentVolumeClaimRetentionPolicy.whenDeleted: Deletewas set; restore from backup. - App can’t find its peer pods → clients using IPs instead of FQDNs; switch to
<sts>-N.<headless>DNS.
Patterns
Production StatefulSet (PostgreSQL-style)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: db
spec:
serviceName: db-headless # MUST match the headless Service
replicas: 3
podManagementPolicy: OrderedReady
persistentVolumeClaimRetentionPolicy:
whenScaled: Retain # don't delete on scale-down
whenDeleted: Retain # don't delete on STS delete
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0 # all pods get updated
selector:
matchLabels: { app: db }
template:
metadata:
labels: { app: db }
spec:
terminationGracePeriodSeconds: 300
containers:
- name: postgres
image: postgres:16
ports:
- { name: pg, containerPort: 5432 }
readinessProbe:
exec:
command: ["pg_isready", "-h", "localhost"]
initialDelaySeconds: 30
volumeMounts:
- { name: data, mountPath: /var/lib/postgresql/data }
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Service
metadata:
name: db-headless
spec:
clusterIP: None # MUST be None for headless
selector: { app: db }
ports:
- { name: pg, port: 5432 }
Partition canary
# Stage 1: update only pod with ordinal 2 (highest)
kubectl patch sts db -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
# Apply new template (changes pod-template-hash)
kubectl set image sts/db postgres=postgres:17
# Observe pod 2 only
kubectl describe pod db-2
# Stage 2: roll forward to pod 1
kubectl patch sts db -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'
# Stage 3: full rollout
kubectl patch sts db -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
When to escalate
- App-level cluster state (split-brain, election timeout, data inconsistency) — engage app/DBA team.
- Persistent volume corruption across multiple ordinals — likely storage backend issue.
- Headless Service interacting badly with a service mesh — coordinate with mesh owner.
Related prompts
-
Kubernetes Deployment Rollout Debug Prompt
Diagnose stuck Deployment rollouts — `ProgressDeadlineExceeded`, replica set churn, maxSurge/maxUnavailable misconfig, image pull pacing, and stuck-mid-rollout recovery.
-
Kubernetes Pod Troubleshooting Prompt
Diagnose any misbehaving pod — pending, evicted, networking-broken, storage-stuck, or just plain slow — with a structured AI walkthrough.
-
Kubernetes PV / PVC / CSI Storage Troubleshooting Prompt
Diagnose stuck PVCs, failed pod mounts, StorageClass provisioning errors, CSI driver crashes, and orphaned volume cleanups.