Kubernetes etcd Health, Backup & Restore Prompt
Operate the etcd backing store — health checks, snapshot backup, defragmentation, leader election issues, restore from snapshot.
- Target user
- Cluster admins on self-managed Kubernetes
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior cluster admin who has operated etcd in production — quorum failures, defragmentation under load, restore-from-snapshot drills. You know that etcd is the single most critical component, and recovery requires care. I will provide: - The cluster type (self-managed kubeadm, kops, RKE, etc.) - The etcd topology (stacked with control plane, external dedicated etcd cluster, single-node) - The symptom (high latency, member down, quorum lost, disk full, backup question) - `etcdctl endpoint health`, `etcdctl member list`, `etcdctl endpoint status` output - etcd version (`etcdctl version`) - Disk type and free space on the etcd data directory Your job: 1. **For health check / monitoring**: - `etcdctl endpoint health` returns OK/error per endpoint - `etcdctl endpoint status -w table` shows leader, term, DB size, raft index - Key metrics: db_size > 8GB (default max), heartbeat send failures, leader changes - Monitor via Prometheus (etcd exporter built-in at `/metrics`) 2. **For backup**: - **Snapshot** via `etcdctl snapshot save` (works on any etcd member) - Snapshots include all keys; restore creates a new cluster from them - Frequency: at least daily; for high-change clusters every 1-6h - Verify with `etcdctl snapshot status <file>` - Store off-host immediately (snapshot on the etcd disk is useless if that disk fails) 3. **For restore**: - Stop ALL etcd members - Run `etcdctl snapshot restore <file>` to create a new data dir - Update each member's config with new `--initial-cluster`, `--initial-cluster-token` - Start members; verify quorum - **Restore is destructive** — replaces existing data; only use for recovery 4. **For defragmentation**: - etcd doesn't reclaim space after deletes; defrag rebuilds the backend file - `etcdctl defrag --endpoints=<endpoint>` per member - Run on FOLLOWERS first, then leader (which causes brief unavailability) - Should be regular (weekly to monthly depending on churn) - DB size approaching `--quota-backend-bytes` (default 2GB) → defrag urgently 5. **For member replacement / scale**: - Add member: `etcdctl member add` then start new member with `--initial-cluster-state=existing` - Remove member: `etcdctl member remove <id>` then stop the member - **Always maintain odd number** for quorum (3 or 5) - Single-node etcd has no HA; quorum = 1 6. **For quorum loss**: - 3-node cluster needs 2 healthy for quorum; 2 down → cluster halted - Recovery from majority loss: restore from snapshot, force-new-cluster - `etcdctl --force-new-cluster` on remaining member rebuilds — destroys other members' state 7. **For latency / slow apply**: - etcd is sensitive to disk fsync latency — needs fast SSD (NVMe preferred) - Check `disk_wal_fsync_duration_seconds` metric — p99 > 100ms is bad - Check `network` between members — RTT > 100ms causes leader instability 8. **For DB size limits**: - Default `--quota-backend-bytes=2GB` - Approaching → revisions accumulating; compact + defrag - Hard cap → apiserver writes fail; cluster degraded 9. **Compaction**: - `etcdctl compact <revision>` — irreversibly removes older revisions - Auto-compaction via `--auto-compaction-mode=periodic --auto-compaction-retention=1h` Mark DESTRUCTIVE: `etcdctl snapshot restore` (replaces data), `--force-new-cluster` (loses other members), `etcdctl compact` to a recent revision (loses history needed for audit), removing a member while another is down (loses quorum). --- Cluster + etcd topology: [DESCRIBE — kubeadm stacked / external / count] Symptom + scope: [DESCRIBE] `etcdctl endpoint health` and `endpoint status -w table`: ``` [PASTE] ``` `etcdctl member list`: ``` [PASTE] ``` Disk type + free space on data dir: [DESCRIBE] Backup schedule + last successful: [DESCRIBE]
Why this prompt works
etcd is invisible until it’s broken — and when it breaks, the whole cluster is down. Most teams don’t drill restores. This prompt walks the lifecycle from health checks to backup, defrag, and restore.
How to use it
- For managed clusters (EKS/GKE/AKS), etcd is provider-managed — you don’t operate it. Skip this prompt; focus on workload concerns.
- For self-managed, treat etcd as your highest-priority component.
- Test restores in a non-production cluster before you need to do one in prod.
- Monitor backups — silent backup failure is the most common cause of lost-data incidents.
Useful commands
# etcdctl setup (typical kubeadm)
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
# Health
sudo -E etcdctl endpoint health
sudo -E etcdctl endpoint status -w table
# Members
sudo -E etcdctl member list -w table
# Backup
sudo -E etcdctl snapshot save /backup/etcd-$(date +%F-%H%M).db
sudo -E etcdctl snapshot status /backup/etcd-<date>.db
# Move off-host
aws s3 cp /backup/etcd-<date>.db s3://etcd-backups/
# Defragmentation (per endpoint)
for EP in https://127.0.0.1:2379 https://etcd2:2379 https://etcd3:2379; do
sudo -E etcdctl --endpoints=$EP defrag
done
# Compaction (manual)
REV=$(sudo -E etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')
sudo -E etcdctl compact $REV
# Database statistics
sudo -E etcdctl endpoint status --write-out=json | jq
# Member add
sudo -E etcdctl member add etcd4 --peer-urls=https://etcd4:2380
# Member remove
sudo -E etcdctl member remove <member-id>
# Restore (full procedure below)
sudo -E etcdctl snapshot restore /backup/etcd-<date>.db \
--data-dir=/var/lib/etcd-restored \
--name=etcd-1 \
--initial-cluster=etcd-1=https://10.0.0.1:2380,etcd-2=https://10.0.0.2:2380,etcd-3=https://10.0.0.3:2380 \
--initial-cluster-token=etcd-cluster-new \
--initial-advertise-peer-urls=https://10.0.0.1:2380
Backup automation pattern
# CronJob to snapshot daily (kubeadm-style etcd container)
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
hostNetwork: true
containers:
- name: backup
image: registry.k8s.io/etcd:3.5.16-0
command:
- sh
- -c
- |
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd-certs/ca.crt \
--cert=/etc/etcd-certs/server.crt \
--key=/etc/etcd-certs/server.key \
snapshot save /backup/etcd-$(date +%F).db && \
aws s3 cp /backup/etcd-$(date +%F).db s3://etcd-backups/
volumeMounts:
- { name: etcd-certs, mountPath: /etc/etcd-certs, readOnly: true }
- { name: backup, mountPath: /backup }
volumes:
- name: etcd-certs
hostPath: { path: /etc/kubernetes/pki/etcd }
- name: backup
emptyDir: {}
restartPolicy: OnFailure
Restore procedure (3-node cluster, all members lost)
# 1. Verify snapshot
sudo -E etcdctl snapshot status /backup/etcd-<date>.db
# 2. Stop all etcd members (kubeadm: move static pod manifest)
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/ # on each control plane
# 3. Restore on each member with unique --name and peer URL
# On node 1:
sudo -E etcdctl snapshot restore /backup/etcd-<date>.db \
--data-dir=/var/lib/etcd-restored \
--name=etcd-cp1 \
--initial-cluster=etcd-cp1=https://10.0.0.1:2380,etcd-cp2=https://10.0.0.2:2380,etcd-cp3=https://10.0.0.3:2380 \
--initial-cluster-token=etcd-cluster-restored-1 \
--initial-advertise-peer-urls=https://10.0.0.1:2380
# Repeat on cp2, cp3 with adjusted --name and --initial-advertise-peer-urls
# 4. Update etcd manifest to use new --data-dir and same --initial-cluster-token
# 5. Restore manifest:
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
# 6. Verify cluster
sudo -E etcdctl endpoint health
sudo -E etcdctl member list
# 7. Restart apiserver + kube-scheduler + kube-controller-manager
Common findings this catches
- DB size approaching 2GB → defrag + auto-compaction.
- Heartbeat failures in logs → network jitter or slow disk; check
wal_fsync_duration_seconds. - Leader election storms → unstable network; check inter-node RTT.
- Single-node etcd in production → urgent: convert to 3-node.
- No backup scheduled → set up CronJob; verify off-host shipping works.
- Snapshot in same cluster as etcd (no off-host backup) → useless for cluster-wide disaster; ship off.
- Restore drill never done → impossible to know if backups work; schedule a drill in a non-prod cluster.
When to escalate
- Permanent quorum loss with no recent snapshot → consult etcd-experienced engineer; data loss may be inevitable.
- Cluster-wide latency correlated with etcd metrics — fix etcd first; cascade affects everything.
- Major version upgrade of etcd — back up first; test restore; coordinate with K8s upgrade if applicable.
Related prompts
-
Kubernetes Audit Log Analysis Prompt
Configure Kubernetes audit policy, query audit logs, detect suspicious activity (kubectl exec, secret reads), and tune for performance.
-
Kubernetes Cluster Upgrade Pre-Flight Planning Prompt
Pre-upgrade safety review of a Kubernetes cluster going N → N+1 (or N+2 skip) — deprecated APIs, removed features, control-plane & node ordering, workload compatibility.
-
Kubernetes Node NotReady Diagnosis Prompt
Diagnose why a Kubernetes Node is `NotReady` — kubelet failures, container runtime crashes, disk/PID pressure, network plugin down, certificate expiry.