Kubernetes Pod Priority & Preemption Prompt
Design PriorityClass hierarchies — critical system pods, tenant tiers, preemption policy, non-preemptive priority, scheduling guarantees.
- Target user
- Kubernetes platform engineers managing multi-priority workloads
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes platform engineer who has built priority hierarchies for multi-tenant clusters. You know that PriorityClass + Preemption is a sharp tool — wrong values evict the wrong workloads. I will provide: - The workload mix (system, prod, dev, batch) - Current priority classes (if any) - Symptom (preempting unexpected pods, critical pods evicted, no preemption when expected) Your job: 1. **PriorityClass basics**: - Cluster-scoped object with integer value (higher = more important) - `value` 0-1,000,000,000 (most reserved for system) - `system-cluster-critical` = 2,000,000,000 (built-in) - `system-node-critical` = 2,000,001,000 (built-in) - `globalDefault: true` — applies to pods without explicit class 2. **Preemption flow**: - Higher-priority pod can't schedule - Scheduler finds lower-priority pods to evict - Evicted pods enter grace period; new pod schedules - **PreemptionPolicy: Never** disables preemption for that PC (still respects priority for scheduling order) 3. **Common hierarchy**: ``` system-node-critical (built-in, ~2B) system-cluster-critical (built-in, ~2B) platform-critical (custom, 1,000,000) # CSI driver, monitoring tenant-tier-1 (high prod, 100,000) tenant-tier-2 (standard prod, 50,000) tenant-tier-3 (dev, 10,000) batch (low, 100) ``` 4. **For unintended preemption**: - Lower-priority pods evicted to make room for higher - If "wrong" pods evicted, check priority values and PodDisruptionBudgets - PDBs are respected by preemption (best effort) 5. **For critical pod eviction**: - System pods should have `system-cluster-critical` or higher - Add `priorityClassName: system-cluster-critical` to system DaemonSets / Deployments 6. **For batch workloads**: - Low priority + `PreemptionPolicy: Never` → patient, doesn't kick others - Schedule when capacity available 7. **For tiered tenants**: - Per-tenant priority classes - Higher tier = guaranteed capacity (more or less) - Combined with ResourceQuota for hard limits 8. **For non-preemptive scheduling**: - `preemptionPolicy: Never` — pod prefers earlier scheduling but won't evict - Useful for "first come first served" semantics Mark DESTRUCTIVE: setting tenant priority above system (evicts CSI driver), `globalDefault: true` on non-default class (every untagged pod inherits), priority values colliding across classes. --- Workload mix: [DESCRIBE] Current PCs: ``` [PASTE `kubectl get priorityclasses`] ``` Symptom: [DESCRIBE]
Why this prompt works
Priority/preemption is powerful but underused or misused. This prompt walks the hierarchy design.
How to use it
- Map workloads to tiers explicitly.
- Reserve top for system components.
- Test preemption under capacity pressure.
- Coordinate with PDBs for survival.
Useful commands
# Priority classes
kubectl get priorityclass
kubectl describe priorityclass <name>
# Per-pod priority
kubectl get pod <pod> -o jsonpath='{.spec.priorityClassName} {.spec.priority}'
# Pods by priority (sorted high to low)
kubectl get pods -A -o json | jq -r '.items[] | "\(.spec.priority // 0) \(.metadata.namespace)/\(.metadata.name)"' | sort -rn | head -20
# Find untagged pods (no priorityClassName)
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.priorityClassName == null) | "\(.metadata.namespace)/\(.metadata.name)"' | head -20
# Watch preemption events
kubectl get events -A --field-selector reason=Preempted
# Test preemption (carefully)
# 1. Fill the cluster with low-priority pods
# 2. Create high-priority pod requesting all CPU
# 3. Observe events
Hierarchy pattern
# Tier 1: Cluster-critical (use built-ins; only modify if necessary)
# system-cluster-critical: ~2,000,000,000
# system-node-critical: ~2,000,001,000
# Tier 2: Platform components (your custom system services)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: platform-critical
value: 1000000
description: "Platform services (monitoring, ingress, CSI)"
preemptionPolicy: PreemptLowerPriority
---
# Tier 3: Production high
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: prod-high
value: 100000
description: "Production critical workloads"
preemptionPolicy: PreemptLowerPriority
---
# Tier 4: Production standard
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: prod-standard
value: 50000
description: "Standard production workloads"
preemptionPolicy: PreemptLowerPriority
---
# Tier 5: Dev/Staging
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: dev
value: 10000
preemptionPolicy: PreemptLowerPriority
---
# Tier 6: Batch (low, non-preemptive)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch
value: 100
preemptionPolicy: Never # batch waits for capacity, doesn't evict
description: "Batch workloads, non-preemptive"
globalDefault: false
Workload uses:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
template:
spec:
priorityClassName: prod-standard
containers:
- name: app
image: myapp
Common findings this catches
- System pods evicted → set
system-cluster-criticalon them. - No preemption when expected → check
preemptionPolicy: Neverwas set. - Untagged pods inheriting wrong default → multiple
globalDefault: true. - PDB violation during preemption → log shows; tune.
- Tenant priority too high → audit; lower to within bounds.
- Batch workloads stealing prod capacity → switch batch to
PreemptionPolicy: Never. - Critical services without priority → add explicitly.
When to escalate
- Designing priority for compliance / SLA — engage stakeholders.
- Capacity / cost analysis — combine with autoscaling strategy.
- Eviction storms in incident — escalate to platform.
Related prompts
-
Kubernetes Resource Limits & OOMKilled Tuning Prompt
Tune CPU/memory requests and limits to stop OOMKilled, fix throttling, right-size HPA targets, and avoid noisy-neighbor scheduling issues.
-
Kubernetes ResourceQuota & LimitRange Design Prompt
Design multi-tenant resource governance — ResourceQuota for namespace caps, LimitRange for per-pod defaults/maxes, scoped quotas, troubleshooting quota exceeded.
-
Kubernetes `FailedScheduling` Debug Prompt
Diagnose `FailedScheduling` events — taints/tolerations mismatch, node affinity, topology spread skew, resource fit failures, and PV zone constraints.