Running StatefulSets in Production Without Surprises

The first StatefulSet I ran in anger was a three-node database, and I treated it like a Deployment because the YAML looked almost identical. Then I tried to roll out a config change and watched the update stall halfway, with one pod stuck and the rollout refusing to proceed. That’s when I learned StatefulSets aren’t “Deployments with stable hostnames” — they have their own ordering, identity, and storage semantics, and the operational rules that follow are genuinely different.

If you’re running stateful workloads on Kubernetes — databases, queues, anything where pod identity and durable storage matter — these are the behaviors that trip people up, and how to work with them instead of against them.

Stable identity is the whole point

A Deployment’s pods are cattle: interchangeable, randomly named, freely replaced. A StatefulSet’s pods are named and ordered — db-0, db-1, db-2 — and each keeps that identity across restarts and rescheduling. That stable identity comes with three guarantees the application can rely on:

A stable network name. Paired with a headless Service, each pod gets a predictable DNS name like db-0.db.payments.svc.cluster.local that survives restarts. Clustered databases need this to find their peers.
Ordered, sequential operations. Pods are created 0, 1, 2 and deleted 2, 1, 0. A pod isn’t created until its predecessor is Running and Ready.
Stable storage. Each pod gets its own PersistentVolumeClaim that follows it. db-0 always reattaches to db-0’s data.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db
spec:
  serviceName: db          # headless Service for stable DNS
  replicas: 3
  podManagementPolicy: OrderedReady
  template:
    spec:
      containers:
        - name: db
          image: postgres:16
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata: { name: data }
      spec:
        accessModes: ["ReadWriteOnce"]
        resources: { requests: { storage: 50Gi } }

The volumeClaimTemplates block is the key difference from a Deployment: each replica gets its own PVC minted from this template, named data-db-0, data-db-1, and so on.

The surprises, and how to handle them

PVCs outlive the StatefulSet. Delete the StatefulSet and the PVCs stay — by design, so you don’t lose data on an accidental delete. But it means scaling down from 5 to 3 leaves data-db-3 and data-db-4 lying around. Scale back up and the new pods reattach to that old data, which may or may not be what you want. Cleaning up requires deleting the orphaned PVCs by hand:

kubectl get pvc -l app=db        # see all of them, including orphans
kubectl delete pvc data-db-4 data-db-3   # only after you're sure

Newer Kubernetes offers persistentVolumeClaimRetentionPolicy to automate this on scale-down or delete — but set it deliberately, because the wrong choice deletes data you wanted to keep.

Ordered rollouts can stall — and that’s a feature. With the default OrderedReady policy, a rolling update updates pods one at a time in reverse order and will not proceed if a pod fails to become Ready. My stalled rollout was the StatefulSet protecting me: a bad config made db-2 unhealthy, so it refused to touch db-1. The fix is to recognize the stall as a signal, not a bug — investigate the stuck pod rather than forcing the rollout.

kubectl rollout status statefulset/db
kubectl describe pod db-2     # why won't it become Ready?

For genuinely independent replicas you can switch podManagementPolicy: Parallel so pods come up together, but only when the app doesn’t require ordered startup.

Scaling is not symmetric with Deployments. Scaling a StatefulSet up adds the next ordinal; scaling down removes the highest ordinal first. For a clustered database, removing db-2 may mean removing a member that holds data or quorum. You often need to drain or decommission the member in the application before scaling down the StatefulSet, or you’ll leave the cluster in a degraded state. The StatefulSet manages pods and storage; it knows nothing about your database’s replication. That gap is yours to bridge.

Updates need partitions for safety

OrderedReady gives you one-at-a-time, but for a careful canary across a stateful cluster, use a partitioned update. Setting updateStrategy.rollingUpdate.partition: N means only pods with ordinal >= N get the new version:

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 2     # only db-2 updates; db-0, db-1 stay on old version

Bump the partition down (2 → 1 → 0) as each tier proves healthy. This is how you canary a database image upgrade: update the highest ordinal, verify it joins the cluster cleanly and serves traffic, then let the next one go. It turns a scary all-at-once upgrade into a controlled, reversible one.

Protect them like the pets they are

Stateful pods are not cattle, so the autoscaling and disruption story is different:

PodDisruptionBudgets are mandatory. A node drain or cluster-autoscaler consolidation that evicts two of three quorum members at once will take the cluster down. A PDB with minAvailable set to your quorum floor prevents it.
Anti-affinity spreads them. Without it, the scheduler may pack db-0, db-1, and db-2 onto one node, and that node’s failure takes the whole cluster. Use pod anti-affinity or topology spread constraints to force them across nodes and zones.
Mind the storage class. ReadWriteOnce volumes are bound to a node/zone. If db-0’s zone goes down, its PVC can’t reattach elsewhere — the pod is stuck Pending until the zone recovers. Design replication so losing a zone loses at most one replica.

When not to run it on Kubernetes at all

Worth saying plainly: a managed database is often the better call. If your cloud offers managed Postgres or a managed queue, the operational burden of running it as a StatefulSet — backups, failover, version upgrades, the quorum dance — frequently isn’t worth it. Run stateful workloads on Kubernetes when you have a real reason (cost, portability, an operator that genuinely automates day-2), not by default.

Where AI helps

StatefulSet incidents are diagnostic puzzles spanning pods, PVCs, and the app’s own clustering. I paste the StatefulSet spec, the stuck pod’s events, and the PVC list and ask the model to explain why a rollout stalled or why a pod is Pending — it’s quick to spot the zone-bound PVC or the missing anti-affinity. It also helps reason about a safe scale-down sequence for a specific database. Run StatefulSet and PDB manifests through our AI code review tool to catch the production traps: a missing PDB, no anti-affinity, or a retention policy that would silently delete data.

StatefulSets are the right tool when identity and storage matter — you just have to respect that they play by different rules than Deployments. Mind the PVCs, treat ordering as a safety feature, and protect the pods like the pets they are. For more, see our Kubernetes and Helm guides.

AI-assisted diagnoses are assistive, not authoritative. Always confirm scale-down and upgrade steps against your database’s own requirements before acting on production data.