Kubernetes DaemonSet Debug Prompt
Diagnose DaemonSet issues — pods not landing on every node, taint/toleration mismatch, node selector misconfig, rollout strategy debugging.
- Target user
- Kubernetes platform engineers running per-node workloads (CNI, logging, monitoring)
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes engineer who has deployed many DaemonSets in production — CNI agents, log collectors (Fluent Bit), monitoring (node-exporter, cAdvisor), security (Falco), storage (CSI node drivers). You know that a DaemonSet skipping nodes is almost always a toleration issue.
I will provide:
- The DaemonSet (`kubectl get ds <name> -o yaml`) — focus on `nodeSelector`, `tolerations`, `affinity`
- `kubectl get ds <name>` showing DESIRED / CURRENT / READY / UP-TO-DATE / NODE-SELECTOR
- Node count and node taints: `kubectl get nodes -o wide` and `kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, taints:.spec.taints}'`
- For nodes missing the pod: their taints, labels, and conditions
- Recent events: `kubectl get events --field-selector involvedObject.kind=DaemonSet,involvedObject.name=<ds>`
Your job:
1. **Compute expected pod placement**:
- A DaemonSet pod runs on a node IFF the pod tolerates the node's taints AND matches the node's labels (per nodeSelector/affinity)
- **Built-in tolerations**: DaemonSets get implicit tolerations for `node.kubernetes.io/not-ready` and `node.kubernetes.io/unreachable` (so they survive failover blips)
- Other taints (`node-role.kubernetes.io/control-plane:NoSchedule`, custom taints, `node.kubernetes.io/disk-pressure`) require explicit tolerations
2. **For "DS has fewer pods than nodes"**:
- Find which nodes don't have a pod: compare `kubectl get pods -l ... -o wide` to `kubectl get nodes`
- On each missing node: check taints (`kubectl describe node <node> | grep -A5 Taints`)
- Verify DS tolerations match those taints
- Check nodeSelector — if node lacks required label, pod won't schedule
3. **For control plane nodes excluded**:
- Control-plane has `node-role.kubernetes.io/control-plane:NoSchedule` taint (and historically `master:NoSchedule`)
- To run on control plane:
```yaml
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
```
4. **For new nodes not getting the pod immediately**:
- DaemonSet controller schedules pods as nodes join — should be within seconds
- Delay = controller lag, image pull on new node, or admission webhook blocking the DS pod
- Check `kubectl get events` on the new node
5. **For rollout issues**:
- `updateStrategy.type: RollingUpdate` (default) — controlled by `maxUnavailable` (default 1)
- `updateStrategy.type: OnDelete` — manual; pods stay on old image until deleted
- For fast-rollout DaemonSets, `maxUnavailable: 50%` or higher (assess blast radius)
6. **For DaemonSet pods that schedule but fail**:
- hostPort conflicts (only one pod per port per node — fine for DS, but not parallel DSes)
- hostNetwork: true and port already in use on host
- hostPath mount permissions / missing path
- Security context too restrictive for the node-level work the DS needs
7. **For DaemonSet on tainted node pools** (e.g., GPU nodes):
- Tolerate the taint AND set nodeSelector for the specific node label
- `tolerations` alone may pull the DS to nodes you didn't intend
8. **For sidecar-style DaemonSets** (e.g., service mesh per-node):
- May need `priorityClassName: system-node-critical` to avoid eviction
Mark DESTRUCTIVE: removing tolerations live (causes immediate pod eviction from tainted nodes), changing the DS selector (`spec.selector` is immutable after creation; you must delete-recreate).
---
DaemonSet: [name + namespace + what it does]
Symptom: [DESCRIBE]
`kubectl get ds <name>`:
```
[PASTE]
```
DS spec — nodeSelector, tolerations, affinity:
```yaml
[PASTE relevant .spec.template.spec]
```
Node count + relevant taints:
```
[PASTE kubectl get nodes + descriptions of missing-pod nodes]
```
Events:
```
[PASTE]
```
Why this prompt works
DaemonSet placement is deterministic given tolerations and selectors, but the failure mode (“only 8 of 10 nodes have it”) looks mysterious without checking each node’s taints. This prompt forces a node-by-node comparison.
How to use it
- Compare pods-per-node vs nodes-total. That gap is your debug target.
- For each missing node, check taints AND labels. Both matter.
- For control plane, the standard taints are well-known; copy the tolerations stanza.
- For DSes that go on specific node pools, use BOTH
nodeSelectorANDtolerations(one constrains, the other tolerates).
Useful commands
# DS view
kubectl get ds -A
kubectl get ds <name> -o yaml
kubectl describe ds <name>
# Per-node placement
kubectl get pods -l app=<label> -o wide --sort-by=.spec.nodeName
# Find missing nodes
NODE_LIST=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
WITH_POD=$(kubectl get pods -l app=<label> -o jsonpath='{.items[*].spec.nodeName}')
echo "$NODE_LIST" | tr ' ' '\n' | sort > /tmp/all
echo "$WITH_POD" | tr ' ' '\n' | sort > /tmp/scheduled
comm -23 /tmp/all /tmp/scheduled # nodes WITHOUT a DS pod
# Per-node taint inspection
kubectl get nodes -o json | \
jq '.items[] | {name:.metadata.name, taints:(.spec.taints // [])}'
# Specific node
kubectl describe node <node> | grep -A5 Taints
kubectl describe node <node> | grep -A20 Conditions
# Rollout
kubectl rollout status ds <name>
kubectl rollout history ds <name>
kubectl rollout restart ds <name>
# Pod failures
kubectl get events --field-selector involvedObject.kind=DaemonSet,involvedObject.name=<name>
Standard toleration stanzas
Run on ALL nodes (including control plane)
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- operator: Exists # tolerate everything (use with care)
Run only on GPU nodes (taint + selector)
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Run on nodes under pressure (e.g., disk-cleanup DS)
tolerations:
- key: node.kubernetes.io/disk-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/memory-pressure
operator: Exists
effect: NoSchedule
Common findings this catches
- DS not on control-plane nodes → no
node-role.kubernetes.io/control-planetoleration. Add. - DS missing on GPU nodes → GPU taint not tolerated. Add
nvidia.com/gputoleration +nodeSelector. - DS not on new nodes joining cluster → admission webhook blocking; check webhook logs.
- DS pod CrashLoopBackOff on some nodes only → node-specific (kernel, hostPath, hardware); compare with DaemonSet-running nodes that work.
hostPortconflict with another DS on same node → choose unique port per DS.- Pod runs but no per-node “work” happens (e.g., logs not collected) → privileged/securityContext insufficient; check what the DS needs (caps, hostPath access).
- DS not rolling out new version →
updateStrategy: OnDelete; need to manually delete pods.
Rollout strategies
# Fast rollout (more parallelism)
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 50%
maxSurge: 0 # DSes don't surge — 1 per node
# Conservative
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
# Manual
updateStrategy:
type: OnDelete
# (then `kubectl delete pod <ds-pod>` to roll forward one node)
When to escalate
- DS skipping nodes despite correct tolerations → check
kube-schedulerlogs for filter rejections; or kube-controller-manager DS controller. - Per-node hostPath issues — coordinate with node owners; may need a privileged init or DaemonSet repair.
- Cluster-wide DS controller failures — engage cluster admin; affects all DSes.
Related prompts
-
Kubernetes Pod Troubleshooting Prompt
Diagnose any misbehaving pod — pending, evicted, networking-broken, storage-stuck, or just plain slow — with a structured AI walkthrough.
-
Kubernetes Resource Limits & OOMKilled Tuning Prompt
Tune CPU/memory requests and limits to stop OOMKilled, fix throttling, right-size HPA targets, and avoid noisy-neighbor scheduling issues.
-
Kubernetes `FailedScheduling` Debug Prompt
Diagnose `FailedScheduling` events — taints/tolerations mismatch, node affinity, topology spread skew, resource fit failures, and PV zone constraints.