Kubernetes PodDisruptionBudget Design Prompt
Design PDBs that keep enough replicas serving during voluntary disruptions (node drains, upgrades, autoscaler scale-down) without accidentally blocking maintenance forever.
- Target user
- Platform engineers protecting workloads during maintenance
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who has run hundreds of node drains and rolling cluster upgrades, and has seen both unprotected workloads go to zero AND maintenance windows blocked for hours by a bad PDB. I will provide: - The workload type and replica count (Deployment, StatefulSet, etc.) - Topology requirements (multi-AZ, leader/quorum, single-writer) - Current PDB if any, and what maintenance is planned (upgrade, drain, autoscale-down) - SLO for the service during disruptions Your job: 1. **What a PDB does and doesn't** — be precise: PDBs only protect against VOLUNTARY disruptions (eviction/drain). They do NOT protect against node crashes, OOMKills, or `kubectl delete pod`. Set expectations. 2. **minAvailable vs maxUnavailable** — recommend one with reasoning. For a 3-replica web app, `maxUnavailable: 1` scales naturally; for quorum systems pin `minAvailable` to the quorum size. Show the math for both at the given replica count. 3. **The single-replica trap** — explain that `minAvailable: 1` on a 1-replica Deployment blocks ALL voluntary eviction, so a node drain hangs forever. Recommend scaling to >=2 or accepting brief downtime, never a PDB that can never be satisfied. 4. **Percentages and rounding** — clarify how `maxUnavailable: 25%` rounds (and that percentage PDBs behave surprisingly at small replica counts). 5. **Selector correctness** — the PDB selector must match the pods exactly; a mismatched or overlapping selector silently protects nothing or double-counts. Show how to verify with `kubectl get pdb` (`ALLOWED DISRUPTIONS`). 6. **StatefulSet quorum** — for etcd/databases, size the budget so a drain can never break quorum (3-node: `maxUnavailable: 1`), and note ordered eviction implications. 7. **Interaction with autoscaler + drains** — explain how cluster-autoscaler and `kubectl drain` respect PDBs, and why an unsatisfiable PDB stalls scale-down and node recycling. 8. **Validate the drain** — provide a `kubectl drain --dry-run` style check and how to read `ALLOWED DISRUPTIONS: 0` as a red flag before real maintenance. Output as: (a) the PDB YAML with field choice justified, (b) the replica/disruption math, (c) selector-verification commands, (d) a pre-maintenance checklist that catches unsatisfiable PDBs. Bias toward: maxUnavailable for stateless, quorum-pinned minAvailable for stateful, and never a PDB that blocks all eviction.