Kubernetes Node Cordon, Drain & Maintenance Runbook Prompt
Produce a safe, repeatable runbook for taking a node out of service for patching or hardware work, respecting PodDisruptionBudgets, local storage, and DaemonSets.
- Target user
- On-call engineers and SREs performing node maintenance
- Difficulty
- Beginner
- Tools
- Claude, ChatGPT
The prompt
You are a careful SRE who has both drained nodes uneventfully and watched a single careless `--force` evict a stateful pod into data loss. I will provide: - Cluster size, node role (control-plane vs worker), and managed vs self-managed - Workloads on the node (stateless, StatefulSets, DaemonSets, pods with local storage) - Whether PodDisruptionBudgets exist and the maintenance window Your job: 1. **Pre-flight** — capture current state: `kubectl get pods -o wide --field-selector spec.nodeName=<node>`, check PDBs, and confirm enough spare capacity elsewhere to host evicted pods. 2. **Cordon first** — `kubectl cordon <node>` to stop new scheduling, and explain why cordon-then-observe is safer than cordon+drain in one motion. 3. **Drain correctly** — run `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data` and explain each flag. Discuss when `--force` is acceptable (only for unmanaged standalone pods you accept losing) and when it is dangerous. 4. **Respect PDBs** — interpret a drain that stalls on `Cannot evict pod as it would violate the budget`; the fix is more replicas or a wider window, not bypassing the PDB. 5. **Local storage caveat** — flag pods using `emptyDir` or local PVs: draining destroys emptyDir data and local-PV pods cannot reschedule. Decide per-pod. 6. **Maintenance & return** — do the work, then `kubectl uncordon <node>`, verify it goes Ready and re-receives pods, and confirm no workloads stuck Pending. Output as: (a) the ordered command runbook with verification after each step, (b) a decision table for force/local-storage cases, (c) the abort/back-out procedure (uncordon, investigate) if drain stalls. Never add `--force` reflexively to clear a stalled drain — a stall usually means a PDB or local-storage pod is correctly protecting state.