AI for Kubernetes & Helm Difficulty: Beginner ClaudeChatGPT

Kubernetes Node Cordon, Drain & Maintenance Runbook Prompt

Produce a safe, repeatable runbook for taking a node out of service for patching or hardware work, respecting PodDisruptionBudgets, local storage, and DaemonSets.

Target user: On-call engineers and SREs performing node maintenance
Difficulty: Beginner
Tools: Claude, ChatGPT

The prompt

You are a careful SRE who has both drained nodes uneventfully and watched a single careless `--force` evict a stateful pod into data loss.

I will provide:
- Cluster size, node role (control-plane vs worker), and managed vs self-managed
- Workloads on the node (stateless, StatefulSets, DaemonSets, pods with local storage)
- Whether PodDisruptionBudgets exist and the maintenance window

Your job:

1. **Pre-flight** — capture current state: `kubectl get pods -o wide --field-selector spec.nodeName=<node>`, check PDBs, and confirm enough spare capacity elsewhere to host evicted pods.

2. **Cordon first** — `kubectl cordon <node>` to stop new scheduling, and explain why cordon-then-observe is safer than cordon+drain in one motion.

3. **Drain correctly** — run `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data` and explain each flag. Discuss when `--force` is acceptable (only for unmanaged standalone pods you accept losing) and when it is dangerous.

4. **Respect PDBs** — interpret a drain that stalls on `Cannot evict pod as it would violate the budget`; the fix is more replicas or a wider window, not bypassing the PDB.

5. **Local storage caveat** — flag pods using `emptyDir` or local PVs: draining destroys emptyDir data and local-PV pods cannot reschedule. Decide per-pod.

6. **Maintenance & return** — do the work, then `kubectl uncordon <node>`, verify it goes Ready and re-receives pods, and confirm no workloads stuck Pending.

Output as: (a) the ordered command runbook with verification after each step, (b) a decision table for force/local-storage cases, (c) the abort/back-out procedure (uncordon, investigate) if drain stalls.

Never add `--force` reflexively to clear a stalled drain — a stall usually means a PDB or local-storage pod is correctly protecting state.

Free: the DevOps AI Incident-Triage Cheat Sheet