You are a senior cluster admin who has operated etcd in production — quorum failures, defragmentation under load, restore-from-snapshot drills. You know that etcd is the single most critical component, and recovery requires care. I will provide: - The cluster type (self-managed kubeadm, kops, RKE, etc.) - The etcd topology (stacked with control plane, external dedicated etcd cluster, single-node) - The symptom (high latency, member down, quorum lost, disk full, backup question) - `etcdctl endpoint health`, `etcdctl member list`, `etcdctl endpoint status` output - etcd version (`etcdctl version`) - Disk type and free space on the etcd data directory Your job: 1. **For health check / monitoring**: - `etcdctl endpoint health` returns OK/error per endpoint - `etcdctl endpoint status -w table` shows leader, term, DB size, raft index - Key metrics: db_size > 8GB (default max), heartbeat send failures, leader changes - Monitor via Prometheus (etcd exporter built-in at `/metrics`) 2. **For backup**: - **Snapshot** via `etcdctl snapshot save` (works on any etcd member) - Snapshots include all keys; restore creates a new cluster from them - Frequency: at least daily; for high-change clusters every 1-6h - Verify with `etcdctl snapshot status <file>` - Store off-host immediately (snapshot on the etcd disk is useless if that disk fails) 3. **For restore**: - Stop ALL etcd members - Run `etcdctl snapshot restore <file>` to create a new data dir - Update each member's config with new `--initial-cluster`, `--initial-cluster-token` - Start members; verify quorum - **Restore is destructive** — replaces existing data; only use for recovery 4. **For defragmentation**: - etcd doesn't reclaim space after deletes; defrag rebuilds the backend file - `etcdctl defrag --endpoints=<endpoint>` per member - Run on FOLLOWERS first, then leader (which causes brief unavailability) - Should be regular (weekly to monthly depending on churn) - DB size approaching `--quota-backend-bytes` (default 2GB) → defrag urgently 5. **For member replacement / scale**: - Add member: `etcdctl member add` then start new member with `--initial-cluster-state=existing` - Remove member: `etcdctl member remove <id>` then stop the member - **Always maintain odd number** for quorum (3 or 5) - Single-node etcd has no HA; quorum = 1 6. **For quorum loss**: - 3-node cluster needs 2 healthy for quorum; 2 down → cluster halted - Recovery from majority loss: restore from snapshot, force-new-cluster - `etcdctl --force-new-cluster` on remaining member rebuilds — destroys other members' state 7. **For latency / slow apply**: - etcd is sensitive to disk fsync latency — needs fast SSD (NVMe preferred) - Check `disk_wal_fsync_duration_seconds` metric — p99 > 100ms is bad - Check `network` between members — RTT > 100ms causes leader instability 8. **For DB size limits**: - Default `--quota-backend-bytes=2GB` - Approaching → revisions accumulating; compact + defrag - Hard cap → apiserver writes fail; cluster degraded 9. **Compaction**: - `etcdctl compact <revision>` — irreversibly removes older revisions - Auto-compaction via `--auto-compaction-mode=periodic --auto-compaction-retention=1h` Mark DESTRUCTIVE: `etcdctl snapshot restore` (replaces data), `--force-new-cluster` (loses other members), `etcdctl compact` to a recent revision (loses history needed for audit), removing a member while another is down (loses quorum). --- Cluster + etcd topology: [DESCRIBE — kubeadm stacked / external / count] Symptom + scope: [DESCRIBE] `etcdctl endpoint health` and `endpoint status -w table`: ``` [PASTE] ``` `etcdctl member list`: ``` [PASTE] ``` Disk type + free space on data dir: [DESCRIBE] Backup schedule + last successful: [DESCRIBE]

Why this prompt works

etcd is invisible until it’s broken — and when it breaks, the whole cluster is down. Most teams don’t drill restores. This prompt walks the lifecycle from health checks to backup, defrag, and restore.

How to use it

For managed clusters (EKS/GKE/AKS), etcd is provider-managed — you don’t operate it. Skip this prompt; focus on workload concerns.
For self-managed, treat etcd as your highest-priority component.
Test restores in a non-production cluster before you need to do one in prod.
Monitor backups — silent backup failure is the most common cause of lost-data incidents.

Useful commands

# etcdctl setup (typical kubeadm)
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key

# Health
sudo -E etcdctl endpoint health
sudo -E etcdctl endpoint status -w table

# Members
sudo -E etcdctl member list -w table

# Backup
sudo -E etcdctl snapshot save /backup/etcd-$(date +%F-%H%M).db
sudo -E etcdctl snapshot status /backup/etcd-<date>.db
# Move off-host
aws s3 cp /backup/etcd-<date>.db s3://etcd-backups/

# Defragmentation (per endpoint)
for EP in https://127.0.0.1:2379 https://etcd2:2379 https://etcd3:2379; do
    sudo -E etcdctl --endpoints=$EP defrag
done

# Compaction (manual)
REV=$(sudo -E etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')
sudo -E etcdctl compact $REV

# Database statistics
sudo -E etcdctl endpoint status --write-out=json | jq

# Member add
sudo -E etcdctl member add etcd4 --peer-urls=https://etcd4:2380

# Member remove
sudo -E etcdctl member remove <member-id>

# Restore (full procedure below)
sudo -E etcdctl snapshot restore /backup/etcd-<date>.db \
    --data-dir=/var/lib/etcd-restored \
    --name=etcd-1 \
    --initial-cluster=etcd-1=https://10.0.0.1:2380,etcd-2=https://10.0.0.2:2380,etcd-3=https://10.0.0.3:2380 \
    --initial-cluster-token=etcd-cluster-new \
    --initial-advertise-peer-urls=https://10.0.0.1:2380

Backup automation pattern

# CronJob to snapshot daily (kubeadm-style etcd container)
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:
          - key: node-role.kubernetes.io/control-plane
            operator: Exists
            effect: NoSchedule
          hostNetwork: true
          containers:
          - name: backup
            image: registry.k8s.io/etcd:3.5.16-0
            command:
            - sh
            - -c
            - |
              ETCDCTL_API=3 etcdctl \
                --endpoints=https://127.0.0.1:2379 \
                --cacert=/etc/etcd-certs/ca.crt \
                --cert=/etc/etcd-certs/server.crt \
                --key=/etc/etcd-certs/server.key \
                snapshot save /backup/etcd-$(date +%F).db && \
              aws s3 cp /backup/etcd-$(date +%F).db s3://etcd-backups/
            volumeMounts:
            - { name: etcd-certs, mountPath: /etc/etcd-certs, readOnly: true }
            - { name: backup, mountPath: /backup }
          volumes:
          - name: etcd-certs
            hostPath: { path: /etc/kubernetes/pki/etcd }
          - name: backup
            emptyDir: {}
          restartPolicy: OnFailure

Restore procedure (3-node cluster, all members lost)

# 1. Verify snapshot
sudo -E etcdctl snapshot status /backup/etcd-<date>.db

# 2. Stop all etcd members (kubeadm: move static pod manifest)
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/   # on each control plane

# 3. Restore on each member with unique --name and peer URL
# On node 1:
sudo -E etcdctl snapshot restore /backup/etcd-<date>.db \
    --data-dir=/var/lib/etcd-restored \
    --name=etcd-cp1 \
    --initial-cluster=etcd-cp1=https://10.0.0.1:2380,etcd-cp2=https://10.0.0.2:2380,etcd-cp3=https://10.0.0.3:2380 \
    --initial-cluster-token=etcd-cluster-restored-1 \
    --initial-advertise-peer-urls=https://10.0.0.1:2380

# Repeat on cp2, cp3 with adjusted --name and --initial-advertise-peer-urls

# 4. Update etcd manifest to use new --data-dir and same --initial-cluster-token
# 5. Restore manifest:
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/

# 6. Verify cluster
sudo -E etcdctl endpoint health
sudo -E etcdctl member list

# 7. Restart apiserver + kube-scheduler + kube-controller-manager

Common findings this catches

DB size approaching 2GB → defrag + auto-compaction.
Heartbeat failures in logs → network jitter or slow disk; check wal_fsync_duration_seconds.
Leader election storms → unstable network; check inter-node RTT.
Single-node etcd in production → urgent: convert to 3-node.
No backup scheduled → set up CronJob; verify off-host shipping works.
Snapshot in same cluster as etcd (no off-host backup) → useless for cluster-wide disaster; ship off.
Restore drill never done → impossible to know if backups work; schedule a drill in a non-prod cluster.

When to escalate

Permanent quorum loss with no recent snapshot → consult etcd-experienced engineer; data loss may be inevitable.
Cluster-wide latency correlated with etcd metrics — fix etcd first; cascade affects everything.
Major version upgrade of etcd — back up first; test restore; coordinate with K8s upgrade if applicable.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

Kubernetes etcd Health, Backup & Restore Prompt

Why this prompt works

How to use it

Useful commands

Backup automation pattern

Restore procedure (3-node cluster, all members lost)

Common findings this catches

When to escalate

Related prompts

Kubernetes Cluster Upgrade Pre-Flight Planning Prompt

Kubernetes Node NotReady Diagnosis Prompt

Kubernetes Audit Log Analysis Prompt

Velero Backup & Restore for Kubernetes Prompt

Reading prompts? Get all 500 in one free PDF