Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kubernetes & Helm By James Joyner IV · · 10 min read

CSI Volume Snapshots for Backing Up Stateful Kubernetes Workloads

Stateful pods need point-in-time backups, not just replicas. Learn how CSI VolumeSnapshots, snapshot classes, and restore flows protect Kubernetes data.

  • #kubernetes
  • #storage
  • #csi
  • #backup
  • #stateful

A teammate once asked me how we would recover the Postgres StatefulSet if someone ran a bad DELETE against production. My answer — “we have three replicas” — was wrong, and I knew it the moment I said it. Replicas protect you from a node dying. They do nothing about a logical mistake that gets faithfully copied to every replica. For that you need a point-in-time snapshot you can roll back to. In Kubernetes, the native way to get one is CSI volume snapshots.

CSI (the Container Storage Interface) added snapshot support so that any storage backend with a compliant driver — EBS, GCE PD, Ceph, Portworx — exposes the same snapshot API to Kubernetes. You declare a snapshot as a resource, the driver does the underlying work, and you get a thing you can restore from. I drafted my first snapshot class with an AI assistant, but the restore drill is something you rehearse with human hands on the keyboard.

The three objects you need to understand

CSI snapshots involve three kinds of resource, and the relationship trips people up:

  • VolumeSnapshotClass — cluster-wide, defines how snapshots are taken (which driver, deletion policy). Analogous to a StorageClass.
  • VolumeSnapshot — namespaced, a request for a snapshot of a specific PVC. This is what you create.
  • VolumeSnapshotContent — cluster-wide, the actual snapshot the driver provisioned. Bound to a VolumeSnapshot like a PV binds to a PVC.

First, the class. Note the deletionPolicyRetain keeps the underlying cloud snapshot if the Kubernetes object is deleted, which is what you want for real backups:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-snapshots
driver: ebs.csi.aws.com
deletionPolicy: Retain

Taking a snapshot

Point a VolumeSnapshot at an existing PVC and the driver gets to work:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-2026-06-17
  namespace: data
spec:
  volumeSnapshotClassName: ebs-snapshots
  source:
    persistentVolumeClaimName: postgres-data-postgres-0

Check that it actually completed before you trust it:

kubectl get volumesnapshot postgres-data-2026-06-17 -n data \
  -o jsonpath='{.status.readyToUse}{"\n"}'

The field you care about is readyToUse: true. A VolumeSnapshot object exists the instant you apply it, but the snapshot is not usable until the driver reports ready. I have seen people kick off a restore against a snapshot that was still uploading. Always check the status.

Pro Tip: Most CSI drivers take crash-consistent snapshots, not application-consistent ones. For a database, that is like pulling the power cord — recovery usually works because of the write-ahead log, but “usually” is not a backup strategy. Flush or quiesce the database first, or use a tool that brackets the snapshot with a freeze, before you rely on it for production data.

Restoring is just a new PVC

You do not “restore in place.” You create a new PVC whose data source is the snapshot, then point a workload at it:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-restored
  namespace: data
spec:
  storageClassName: gp3
  dataSource:
    name: postgres-data-2026-06-17
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 100Gi

The new PVC provisions a volume pre-populated from the snapshot. Your restore size must be at least as large as the original. From there you spin up a recovery Pod against postgres-data-restored, verify the data, and only then decide whether to cut traffic over. Keeping the restore as a separate PVC means you never destroy the thing you are trying to recover from.

Scheduling snapshots

Native CSI has no built-in scheduler — a VolumeSnapshot is a one-shot request. For periodic backups you either run a CronJob that templates and applies a timestamped VolumeSnapshot, or you adopt a tool like the external snapshotter’s scheduling add-ons or Velero. A minimal CronJob approach:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-snapshot
  namespace: data
spec:
  schedule: "0 */6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: snapshotter
          restartPolicy: OnFailure
          containers:
            - name: snap
              image: bitnami/kubectl:1.30
              command:
                - /bin/sh
                - -c
                - |
                  kubectl create -f - <<EOF
                  apiVersion: snapshot.storage.k8s.io/v1
                  kind: VolumeSnapshot
                  metadata:
                    name: postgres-$(date +%Y%m%d-%H%M%S)
                    namespace: data
                  spec:
                    volumeSnapshotClassName: ebs-snapshots
                    source:
                      persistentVolumeClaimName: postgres-data-postgres-0
                  EOF

That ServiceAccount needs RBAC to create volumesnapshots in the namespace — and nothing more. Scope it tightly; I walk through that mindset in our piece on RBAC without the headaches.

Don’t forget retention

Snapshots cost money on every cloud, and a CronJob that only creates them will quietly bankrupt you. Pair creation with a pruning step that deletes snapshots older than your retention window. With deletionPolicy: Retain, deleting the Kubernetes object leaves the cloud snapshot — so your pruning has to handle the backend too, or you accumulate orphaned snapshots that no longer show up in kubectl get volumesnapshot.

Where AI fits

The snapshot class, the restore PVC, the CronJob template — these are great tasks to hand to an assistant. I describe the driver and retention I want and let a tool like Claude or GitHub Copilot generate the YAML, then I read every field. The AI is a fast junior engineer: superb at boilerplate, blind to whether your database needs quiescing first. It never touches the cluster. I run the restore drill myself, against a non-production namespace, with my own credentials — the model never gets a kubeconfig. A restore is the one operation you cannot afford to get wrong on the day you need it, so rehearse it by hand. If you want generated backup manifests reviewed before they land, the code review dashboard catches missing deletion policies and over-broad RBAC.

Conclusion

Replicas survive hardware failure; snapshots survive human failure. Get comfortable with the class/snapshot/content trio, always check readyToUse, restore into a fresh PVC, and quiesce databases before you trust a snapshot. Let AI draft the YAML, keep the restore drill in human hands, and never give the model your cluster credentials. More storage patterns live under kubernetes-helm, and you can grab ready-made prompts from the prompt packs.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.