Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes StatefulSet Debug Prompt

Diagnose StatefulSet issues — ordered deployment stuck, headless Service not resolving, PVC claim template misbehavior, scale-down problems, partition rollouts.

Target user
Kubernetes platform engineers running stateful workloads (databases, queues, ZK)
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes engineer who has run StatefulSets in production for databases (PostgreSQL, MongoDB, Cassandra), queues (Kafka, RabbitMQ), and coordination (etcd, ZooKeeper). You know that "ordered" deployment is a sharp tool and that scaling down a StatefulSet does NOT delete PVCs by default.

I will provide:
- The StatefulSet name and what it runs (DB, queue, etc.)
- The symptom (stuck at pod N, headless Service issues, PVC claim template not creating, pod-N-network-identity broken, scale-down stuck)
- `kubectl get sts <name>` and `kubectl describe sts <name>`
- `kubectl get pods -l app=<label>` (ordered numbering should be visible)
- The headless Service: `kubectl get svc <name>-headless -o yaml`
- PVC inventory: `kubectl get pvc -l app=<label>`

Your job:

1. **Verify StatefulSet contract**:
   - **Stable network identity**: `<pod-name>.<headless-svc>.<namespace>.svc.cluster.local`
   - **Ordered creation**: pod 0 must be Ready before pod 1 starts (default policy `OrderedReady`)
   - **Ordered termination**: highest-ordinal pod terminates first
   - **PersistentVolumeClaim templates**: each pod gets its own PVC named `<vct-name>-<sts-name>-<ordinal>`
   - **Headless Service**: `clusterIP: None`; required for stable DNS per pod
2. **For "stuck at pod N"** during scale-up:
   - Pod N-1 is not Ready → check its readiness, init containers, image
   - PVC binding stuck → see [PVC storage troubleshooting](/prompts/kubernetes-pvc-storage-troubleshooting/)
   - `podManagementPolicy: Parallel` could be set if you want concurrent starts (and don't need ordering)
3. **For headless Service issues**:
   - Service `clusterIP: None` confirmed?
   - `kubectl get endpoints <svc>` shows per-pod IPs?
   - DNS resolution from another pod: `kubectl exec ... -- nslookup <pod>.<svc>.<ns>`
   - Pod hostname matches expected (`<sts-name>-<ordinal>`)?
4. **For PVC claim template not creating PVC**:
   - `volumeClaimTemplates:` defines what each pod gets — verify the spec is valid
   - StorageClass exists?
   - PVC is created at pod start; if pod is stuck pre-Running, PVC may be stuck Pending
5. **For scale-down issues**:
   - **PVCs persist after pod deletion by default** — desired behavior; scaling down to 0 doesn't lose data
   - **`persistentVolumeClaimRetentionPolicy`** (1.27+ GA): `whenScaled` and `whenDeleted` control PVC lifecycle
   - **Scale-down blocked by Pod Disruption Budget** — `kubectl get pdb`
   - **Pod stuck Terminating** with high ordinal → finalizer or graceful termination hang
6. **For partition rollouts** (canary in StatefulSets):
   - `updateStrategy.rollingUpdate.partition: N` — pods with ordinal >= N get the new template; others stay on old
   - Decrease partition over time to roll forward
   - Useful for testing on highest-ordinal pod first
7. **For app-specific data corruption / re-elect needed** (Cassandra/Kafka/etcd):
   - This prompt is K8s-level; app-level requires app expertise
   - But: ensure data PVCs aren't accidentally deleted; check `persistentVolumeClaimRetentionPolicy`
8. **For pod identity changes** after reschedule:
   - Pod name should be stable (`<sts>-N`); IP changes are normal
   - Use the FQDN for client connections, not IPs
   - Headless Service must exist BEFORE StatefulSet creation for DNS to wire correctly

Mark DESTRUCTIVE: deleting PVCs of a StatefulSet (data loss), setting `persistentVolumeClaimRetentionPolicy.whenDeleted: Delete` (data lost on STS delete), force-deleting a pod with `--grace-period=0` (may cause split-brain in clustered apps).

---

StatefulSet workload: [DESCRIBE — DB / queue / coordination]
Symptom: [DESCRIBE]
`kubectl describe sts <name>`:
```
[PASTE]
```
`kubectl get pods -l app=<label> -o wide`:
```
[PASTE]
```
Headless Service: `kubectl get svc <name>-headless -o yaml`:
```yaml
[PASTE]
```
PVCs:
```
[PASTE `kubectl get pvc -l app=<label>`]
```
Update strategy: `kubectl get sts <name> -o yaml | yq '.spec.updateStrategy'`:
```
[PASTE]
```

Why this prompt works

StatefulSet quirks (ordered deployment, stable DNS, PVC retention) trip up engineers familiar with Deployments. This prompt walks the contract and forces a per-component check (headless Service, PVC, ordering).

How to use it

  1. Verify the headless Service first — without it, DNS is broken and clustered apps can’t bootstrap.
  2. Check PVCs separately from pods. A stuck PVC is the most common cause of stuck pods.
  3. Mind the ordering: pod N depends on pod N-1’s readiness.
  4. For app-level cluster issues (Cassandra ring, Kafka rebalance), pair with app-specific debugging.

Useful commands

# StatefulSet state
kubectl get sts -A
kubectl get sts <name> -o yaml
kubectl describe sts <name>

# Pods in ordinal order
kubectl get pods -l app=<label> -o wide
# Expected: <sts>-0, <sts>-1, <sts>-2 with stable names

# Headless Service
kubectl get svc <name>-headless -o yaml | yq '.spec.clusterIP'    # should be None
kubectl get endpoints <name>-headless

# DNS resolution from inside cluster
kubectl run dnstest --rm -it --image=busybox:1.28 --restart=Never -- \
    nslookup <sts>-0.<headless-svc>.<ns>.svc.cluster.local

# PVCs
kubectl get pvc -l app=<label>
kubectl describe pvc <pvc>

# Update strategy
kubectl get sts <name> -o yaml | yq '.spec.updateStrategy'

# Partition rollout
kubectl patch sts <name> -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'

# Scale
kubectl scale sts <name> --replicas=5

# Per-pod logs (replicas may differ)
kubectl logs <sts>-0
kubectl logs <sts>-0 -c <sidecar>

# Force restart specific pod (controller recreates with same identity)
kubectl delete pod <sts>-0

Common findings this catches

  • Pod 0 stuck Pending → PVC stuck Pending → check StorageClass and provisioner. See PVC troubleshooting.
  • Pod 1 won’t start → Pod 0 not Ready. Investigate Pod 0’s readiness probe.
  • DNS <sts>-0.<headless>.<ns> doesn’t resolve → headless Service clusterIP not None, or selector doesn’t match pods.
  • Scale-down stuck → PDB blocking; or pod’s graceful shutdown hanging.
  • Old pod template still running after partition rollout → partition value not yet decremented past those ordinals.
  • PVCs accidentally deleted on STS updatepersistentVolumeClaimRetentionPolicy.whenDeleted: Delete was set; restore from backup.
  • App can’t find its peer pods → clients using IPs instead of FQDNs; switch to <sts>-N.<headless> DNS.

Patterns

Production StatefulSet (PostgreSQL-style)

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db
spec:
  serviceName: db-headless         # MUST match the headless Service
  replicas: 3
  podManagementPolicy: OrderedReady
  persistentVolumeClaimRetentionPolicy:
    whenScaled: Retain             # don't delete on scale-down
    whenDeleted: Retain             # don't delete on STS delete
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0                  # all pods get updated
  selector:
    matchLabels: { app: db }
  template:
    metadata:
      labels: { app: db }
    spec:
      terminationGracePeriodSeconds: 300
      containers:
      - name: postgres
        image: postgres:16
        ports:
        - { name: pg, containerPort: 5432 }
        readinessProbe:
          exec:
            command: ["pg_isready", "-h", "localhost"]
          initialDelaySeconds: 30
        volumeMounts:
        - { name: data, mountPath: /var/lib/postgresql/data }
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 50Gi
---
apiVersion: v1
kind: Service
metadata:
  name: db-headless
spec:
  clusterIP: None                   # MUST be None for headless
  selector: { app: db }
  ports:
  - { name: pg, port: 5432 }

Partition canary

# Stage 1: update only pod with ordinal 2 (highest)
kubectl patch sts db -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'

# Apply new template (changes pod-template-hash)
kubectl set image sts/db postgres=postgres:17

# Observe pod 2 only
kubectl describe pod db-2

# Stage 2: roll forward to pod 1
kubectl patch sts db -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'

# Stage 3: full rollout
kubectl patch sts db -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'

When to escalate

  • App-level cluster state (split-brain, election timeout, data inconsistency) — engage app/DBA team.
  • Persistent volume corruption across multiple ordinals — likely storage backend issue.
  • Headless Service interacting badly with a service mesh — coordinate with mesh owner.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.