etcd Backup and Restore for Kubernetes Clusters

Every Kubernetes object you’ve ever created — every Deployment, Secret, ConfigMap, RBAC binding, CRD instance — lives in etcd. The API server is stateless; etcd is the truth. If you run a managed cluster (EKS, GKE, AKS), the provider handles etcd and you can skip most of this. But if you self-manage a control plane with kubeadm, k3s, or bare clusters, etcd is the single component whose loss means losing the entire cluster’s state. Backing it up is not optional, and — this is the part people skip — neither is testing the restore.

I’ve watched a team discover during a real outage that their nightly etcd backup had been writing zero-byte files for three weeks because a cert had rotated and the backup job failed silently. The backup that’s never restored is a hope, not a plan. Let’s make it a plan.

Taking a snapshot

etcd backups are point-in-time snapshots via etcdctl snapshot save. The command needs the etcd endpoint and the client certs — on a kubeadm cluster these live under /etc/kubernetes/pki/etcd/:

ETCDCTL_API=3 etcdctl snapshot save /backups/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Always verify the snapshot immediately — this is what catches the zero-byte-file failure:

ETCDCTL_API=3 etcdctl snapshot status /backups/etcd-20260612-0200.db \
  --write-out=table

That prints the hash, revision, total keys, and size. A healthy snapshot has a sane key count and a non-trivial size. A backup job that doesn’t run snapshot status and assert on the output is a backup job you can’t trust.

Automate it, and ship it off the node

A snapshot sitting on the same disk as etcd protects you from logical corruption but not from losing the node. Automate the snapshot and ship it somewhere else.

#!/usr/bin/env bash
set -euo pipefail
SNAP="/backups/etcd-$(date +%Y%m%d-%H%M%S).db"
ETCDCTL_API=3 etcdctl snapshot save "$SNAP" \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
# fail loudly if the snapshot is bad
ETCDCTL_API=3 etcdctl snapshot status "$SNAP" --write-out=json > /dev/null
# offsite, encrypted
aws s3 cp "$SNAP" "s3://acme-etcd-backups/$(hostname)/" --sse aws:kms
find /backups -name 'etcd-*.db' -mtime +7 -delete

Run it from a systemd timer (more reliable than cron for this) every few hours. Critical detail: etcd snapshots contain Secrets in whatever form etcd stores them. If you haven’t enabled encryption-at-rest in the API server, your Secrets are effectively plaintext in the snapshot. Encrypt the backups in transit and at rest, and lock down who can read the bucket as tightly as you’d lock down cluster-admin.

How frequently?

The backup interval sets your worst-case data loss (RPO). If you snapshot every 6 hours, a control-plane loss can cost up to 6 hours of cluster state — every object created since the last snapshot. For most clusters where workloads are also defined in Git (GitOps), that’s survivable because you can re-sync. For clusters with state that only lives in etcd — dynamically created resources, operator-managed objects — snapshot more often. Match the interval to how much state you can afford to recreate.

The restore — practice it before you need it

Restoring is fundamentally different from saving: you don’t restore into a running etcd, you restore into a new data directory and then point a fresh etcd at it. On a single control-plane node, the sequence is:

# 1. stop the control plane (kubeadm runs these as static pods)
mv /etc/kubernetes/manifests /etc/kubernetes/manifests.bak

# 2. restore the snapshot to a new data dir
ETCDCTL_API=3 etcdctl snapshot restore /backups/etcd-20260612-0200.db \
  --data-dir=/var/lib/etcd-restored

# 3. point etcd at the restored dir
#    edit /etc/kubernetes/manifests.bak/etcd.yaml: hostPath -> /var/lib/etcd-restored

# 4. bring the control plane back
mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests

For a multi-node etcd cluster it’s more involved — the restore must set --initial-cluster, --initial-advertise-peer-urls, and --name to match the topology, and you rebuild the quorum member by member. The exact flags depend on your peer URLs, which is precisely why you do not want to be reading the docs for the first time during an outage.

Run a restore drill on a throwaway cluster. Stand up a kubeadm cluster, take a snapshot, deliberately wipe etcd, and restore. Time it. Write down the exact commands for your topology. The drill surfaces the gotchas — cert paths, peer URL mismatches, the static-pod dance — while the stakes are zero. A team that has restored once in practice recovers in minutes; a team that never has spends the outage learning.

etcd health between disasters

Backups are insurance; healthy etcd is prevention. Watch a few signals:

ETCDCTL_API=3 etcdctl endpoint health --cluster \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

ETCDCTL_API=3 etcdctl endpoint status --cluster --write-out=table ...

Alert on the database size approaching its quota (the default 2Gi/8Gi limit will put etcd into a read-only alarm if hit), on rising fsync latency (etcd is brutally sensitive to slow disks — give it fast SSDs), and on any loss of quorum. A degraded etcd is a control plane on borrowed time.

Where AI helps

The restore procedure is exactly the kind of high-stakes, low-frequency task where it pays to have a second reviewer. I draft the restore runbook for my specific topology and have AI check it against the etcd docs for the flags I’m likely to fumble — the --initial-cluster string for a 3-node restore is a classic place to get it subtly wrong. It’s also good at reading endpoint status output and explaining whether a leader-change or high-latency reading is normal. Run your backup scripts and restore runbook through our AI code review tool to catch the silent-failure traps, like a snapshot job that never asserts on snapshot status.

etcd is the one component where “we have backups” and “we can restore” are different sentences. Snapshot often, ship offsite encrypted, and — above all — practice the restore before the day you need it. For more on running clusters safely, see our Kubernetes and Helm guides.

AI-assisted runbooks are assistive, not authoritative. Always validate restore procedures on a non-production cluster before relying on them.