Kubernetes StatefulSet Debug Prompt

Diagnose StatefulSet issues — ordered deployment stuck, headless Service not resolving, PVC claim template misbehavior, scale-down problems, partition rollouts.

Target user

Kubernetes platform engineers running stateful workloads (databases, queues, ZK)

Difficulty

Intermediate

Tools

Claude, ChatGPT

You are a senior Kubernetes engineer who has run StatefulSets in production for databases (PostgreSQL, MongoDB, Cassandra), queues (Kafka, RabbitMQ), and coordination (etcd, ZooKeeper). You know that "ordered" deployment is a sharp tool and that scaling down a StatefulSet does NOT delete PVCs by default. I will provide: - The StatefulSet name and what it runs (DB, queue, etc.) - The symptom (stuck at pod N, headless Service issues, PVC claim template not creating, pod-N-network-identity broken, scale-down stuck) - `kubectl get sts <name>` and `kubectl describe sts <name>` - `kubectl get pods -l app=<label>` (ordered numbering should be visible) - The headless Service: `kubectl get svc <name>-headless -o yaml` - PVC inventory: `kubectl get pvc -l app=<label>` Your job: 1. **Verify StatefulSet contract**: - **Stable network identity**: `<pod-name>.<headless-svc>.<namespace>.svc.cluster.local` - **Ordered creation**: pod 0 must be Ready before pod 1 starts (default policy `OrderedReady`) - **Ordered termination**: highest-ordinal pod terminates first - **PersistentVolumeClaim templates**: each pod gets its own PVC named `<vct-name>-<sts-name>-<ordinal>` - **Headless Service**: `clusterIP: None`; required for stable DNS per pod 2. **For "stuck at pod N"** during scale-up: - Pod N-1 is not Ready → check its readiness, init containers, image - PVC binding stuck → see [PVC storage troubleshooting](/prompts/kubernetes-pvc-storage-troubleshooting/) - `podManagementPolicy: Parallel` could be set if you want concurrent starts (and don't need ordering) 3. **For headless Service issues**: - Service `clusterIP: None` confirmed? - `kubectl get endpoints <svc>` shows per-pod IPs? - DNS resolution from another pod: `kubectl exec ... -- nslookup <pod>.<svc>.<ns>` - Pod hostname matches expected (`<sts-name>-<ordinal>`)? 4. **For PVC claim template not creating PVC**: - `volumeClaimTemplates:` defines what each pod gets — verify the spec is valid - StorageClass exists? - PVC is created at pod start; if pod is stuck pre-Running, PVC may be stuck Pending 5. **For scale-down issues**: - **PVCs persist after pod deletion by default** — desired behavior; scaling down to 0 doesn't lose data - **`persistentVolumeClaimRetentionPolicy`** (1.27+ GA): `whenScaled` and `whenDeleted` control PVC lifecycle - **Scale-down blocked by Pod Disruption Budget** — `kubectl get pdb` - **Pod stuck Terminating** with high ordinal → finalizer or graceful termination hang 6. **For partition rollouts** (canary in StatefulSets): - `updateStrategy.rollingUpdate.partition: N` — pods with ordinal >= N get the new template; others stay on old - Decrease partition over time to roll forward - Useful for testing on highest-ordinal pod first 7. **For app-specific data corruption / re-elect needed** (Cassandra/Kafka/etcd): - This prompt is K8s-level; app-level requires app expertise - But: ensure data PVCs aren't accidentally deleted; check `persistentVolumeClaimRetentionPolicy` 8. **For pod identity changes** after reschedule: - Pod name should be stable (`<sts>-N`); IP changes are normal - Use the FQDN for client connections, not IPs - Headless Service must exist BEFORE StatefulSet creation for DNS to wire correctly Mark DESTRUCTIVE: deleting PVCs of a StatefulSet (data loss), setting `persistentVolumeClaimRetentionPolicy.whenDeleted: Delete` (data lost on STS delete), force-deleting a pod with `--grace-period=0` (may cause split-brain in clustered apps). --- StatefulSet workload: [DESCRIBE — DB / queue / coordination] Symptom: [DESCRIBE] `kubectl describe sts <name>`: ``` [PASTE] ``` `kubectl get pods -l app=<label> -o wide`: ``` [PASTE] ``` Headless Service: `kubectl get svc <name>-headless -o yaml`: ```yaml [PASTE] ``` PVCs: ``` [PASTE `kubectl get pvc -l app=<label>`] ``` Update strategy: `kubectl get sts <name> -o yaml | yq '.spec.updateStrategy'`: ``` [PASTE] ```

Why this prompt works

StatefulSet quirks (ordered deployment, stable DNS, PVC retention) trip up engineers familiar with Deployments. This prompt walks the contract and forces a per-component check (headless Service, PVC, ordering).

How to use it

Verify the headless Service first — without it, DNS is broken and clustered apps can’t bootstrap.
Check PVCs separately from pods. A stuck PVC is the most common cause of stuck pods.
Mind the ordering: pod N depends on pod N-1’s readiness.
For app-level cluster issues (Cassandra ring, Kafka rebalance), pair with app-specific debugging.

Useful commands

# StatefulSet state
kubectl get sts -A
kubectl get sts <name> -o yaml
kubectl describe sts <name>

# Pods in ordinal order
kubectl get pods -l app=<label> -o wide
# Expected: <sts>-0, <sts>-1, <sts>-2 with stable names

# Headless Service
kubectl get svc <name>-headless -o yaml | yq '.spec.clusterIP'    # should be None
kubectl get endpoints <name>-headless

# DNS resolution from inside cluster
kubectl run dnstest --rm -it --image=busybox:1.28 --restart=Never -- \
    nslookup <sts>-0.<headless-svc>.<ns>.svc.cluster.local

# PVCs
kubectl get pvc -l app=<label>
kubectl describe pvc <pvc>

# Update strategy
kubectl get sts <name> -o yaml | yq '.spec.updateStrategy'

# Partition rollout
kubectl patch sts <name> -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'

# Scale
kubectl scale sts <name> --replicas=5

# Per-pod logs (replicas may differ)
kubectl logs <sts>-0
kubectl logs <sts>-0 -c <sidecar>

# Force restart specific pod (controller recreates with same identity)
kubectl delete pod <sts>-0

Common findings this catches

Pod 0 stuck Pending → PVC stuck Pending → check StorageClass and provisioner. See PVC troubleshooting.
Pod 1 won’t start → Pod 0 not Ready. Investigate Pod 0’s readiness probe.
DNS <sts>-0.<headless>.<ns> doesn’t resolve → headless Service clusterIP not None, or selector doesn’t match pods.
Scale-down stuck → PDB blocking; or pod’s graceful shutdown hanging.
Old pod template still running after partition rollout → partition value not yet decremented past those ordinals.
PVCs accidentally deleted on STS update → persistentVolumeClaimRetentionPolicy.whenDeleted: Delete was set; restore from backup.
App can’t find its peer pods → clients using IPs instead of FQDNs; switch to <sts>-N.<headless> DNS.

Patterns

Production StatefulSet (PostgreSQL-style)

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db
spec:
  serviceName: db-headless         # MUST match the headless Service
  replicas: 3
  podManagementPolicy: OrderedReady
  persistentVolumeClaimRetentionPolicy:
    whenScaled: Retain             # don't delete on scale-down
    whenDeleted: Retain             # don't delete on STS delete
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0                  # all pods get updated
  selector:
    matchLabels: { app: db }
  template:
    metadata:
      labels: { app: db }
    spec:
      terminationGracePeriodSeconds: 300
      containers:
      - name: postgres
        image: postgres:16
        ports:
        - { name: pg, containerPort: 5432 }
        readinessProbe:
          exec:
            command: ["pg_isready", "-h", "localhost"]
          initialDelaySeconds: 30
        volumeMounts:
        - { name: data, mountPath: /var/lib/postgresql/data }
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 50Gi
---
apiVersion: v1
kind: Service
metadata:
  name: db-headless
spec:
  clusterIP: None                   # MUST be None for headless
  selector: { app: db }
  ports:
  - { name: pg, port: 5432 }

Partition canary

# Stage 1: update only pod with ordinal 2 (highest)
kubectl patch sts db -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'

# Apply new template (changes pod-template-hash)
kubectl set image sts/db postgres=postgres:17

# Observe pod 2 only
kubectl describe pod db-2

# Stage 2: roll forward to pod 1
kubectl patch sts db -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'

# Stage 3: full rollout
kubectl patch sts db -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'

When to escalate

App-level cluster state (split-brain, election timeout, data inconsistency) — engage app/DBA team.
Persistent volume corruption across multiple ordinals — likely storage backend issue.
Headless Service interacting badly with a service mesh — coordinate with mesh owner.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

Kubernetes StatefulSet Debug Prompt

Why this prompt works

How to use it

Useful commands

Common findings this catches

Patterns

Production StatefulSet (PostgreSQL-style)

Partition canary

When to escalate

Related prompts

Kubernetes PV / PVC / CSI Storage Troubleshooting Prompt

Kubernetes Pod Troubleshooting Prompt

Kubernetes Deployment Rollout Debug Prompt

Kubernetes Volume Populators & dataSourceRef Design Prompt

Reading prompts? Get all 500 in one free PDF