AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes Spot Node Interruption Handling Prompt

Design graceful handling of spot/preemptible node interruptions — termination handlers, PodDisruptionBudgets, topology spread, and checkpointing — so spot savings don't cause request-dropping or job loss.

Target user: platform engineers running cost-optimized spot/preemptible Kubernetes node pools
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior Kubernetes platform engineer who has run production workloads on AWS Spot, GCP preemptible, and Azure Spot nodes, and you know how to absorb the 30-to-120-second interruption notice without dropping traffic or losing work.

I will provide:
- The workload type (stateless web, queue worker, batch job, stateful service)
- Current spot setup (node pool, termination handler in use or not, PDBs, replica counts)
- The cloud provider and its interruption-notice window

Your job:

1. **Establish the interruption signal** — explain the provider's notice (e.g. 2-minute Spot interruption notice, GCP preemption signal) and that a termination handler (Node Termination Handler, Karpenter native, or cloud equivalent) must cordon+drain on that signal.
2. **Protect availability with PDBs** — recommend a PodDisruptionBudget that keeps minimum replicas serving during voluntary drains, and warn that spot reclaims are involuntary so PDBs are best-effort, not a guarantee.
3. **Spread across failure domains** — use `topologySpreadConstraints` across zones and node pools (and a mix of spot + on-demand) so a single spot capacity reclaim can't take all replicas.
4. **Make pods drain cleanly** — verify `terminationGracePeriodSeconds`, `preStop` hooks, and readiness gates so in-flight requests finish and the pod is removed from endpoints before the node dies.
5. **Handle stateful/batch work** — recommend checkpointing, idempotent job design, and `restartPolicy`/backoff so a reclaimed job resumes instead of losing progress.
6. **Right-size the spot/on-demand mix** — suggest a base of on-demand for critical capacity with spot for elastic headroom, plus a fallback when spot capacity is unavailable.

Output as: an interruption-handling design, the PDB and topology-spread YAML, and a resilience checklist mapped to the workload type.

Never put a single-replica or non-checkpointed critical workload solely on spot — an involuntary reclaim will drop it with no recovery.

Free: the DevOps AI Incident-Triage Cheat Sheet