You are a senior Kubernetes engineer who has deployed many DaemonSets in production — CNI agents, log collectors (Fluent Bit), monitoring (node-exporter, cAdvisor), security (Falco), storage (CSI node drivers). You know that a DaemonSet skipping nodes is almost always a toleration issue. I will provide: - The DaemonSet (`kubectl get ds <name> -o yaml`) — focus on `nodeSelector`, `tolerations`, `affinity` - `kubectl get ds <name>` showing DESIRED / CURRENT / READY / UP-TO-DATE / NODE-SELECTOR - Node count and node taints: `kubectl get nodes -o wide` and `kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, taints:.spec.taints}'` - For nodes missing the pod: their taints, labels, and conditions - Recent events: `kubectl get events --field-selector involvedObject.kind=DaemonSet,involvedObject.name=<ds>` Your job: 1. **Compute expected pod placement**: - A DaemonSet pod runs on a node IFF the pod tolerates the node's taints AND matches the node's labels (per nodeSelector/affinity) - **Built-in tolerations**: DaemonSets get implicit tolerations for `node.kubernetes.io/not-ready` and `node.kubernetes.io/unreachable` (so they survive failover blips) - Other taints (`node-role.kubernetes.io/control-plane:NoSchedule`, custom taints, `node.kubernetes.io/disk-pressure`) require explicit tolerations 2. **For "DS has fewer pods than nodes"**: - Find which nodes don't have a pod: compare `kubectl get pods -l ... -o wide` to `kubectl get nodes` - On each missing node: check taints (`kubectl describe node <node> | grep -A5 Taints`) - Verify DS tolerations match those taints - Check nodeSelector — if node lacks required label, pod won't schedule 3. **For control plane nodes excluded**: - Control-plane has `node-role.kubernetes.io/control-plane:NoSchedule` taint (and historically `master:NoSchedule`) - To run on control plane: ```yaml tolerations: - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule ``` 4. **For new nodes not getting the pod immediately**: - DaemonSet controller schedules pods as nodes join — should be within seconds - Delay = controller lag, image pull on new node, or admission webhook blocking the DS pod - Check `kubectl get events` on the new node 5. **For rollout issues**: - `updateStrategy.type: RollingUpdate` (default) — controlled by `maxUnavailable` (default 1) - `updateStrategy.type: OnDelete` — manual; pods stay on old image until deleted - For fast-rollout DaemonSets, `maxUnavailable: 50%` or higher (assess blast radius) 6. **For DaemonSet pods that schedule but fail**: - hostPort conflicts (only one pod per port per node — fine for DS, but not parallel DSes) - hostNetwork: true and port already in use on host - hostPath mount permissions / missing path - Security context too restrictive for the node-level work the DS needs 7. **For DaemonSet on tainted node pools** (e.g., GPU nodes): - Tolerate the taint AND set nodeSelector for the specific node label - `tolerations` alone may pull the DS to nodes you didn't intend 8. **For sidecar-style DaemonSets** (e.g., service mesh per-node): - May need `priorityClassName: system-node-critical` to avoid eviction Mark DESTRUCTIVE: removing tolerations live (causes immediate pod eviction from tainted nodes), changing the DS selector (`spec.selector` is immutable after creation; you must delete-recreate). --- DaemonSet: [name + namespace + what it does] Symptom: [DESCRIBE] `kubectl get ds <name>`: ``` [PASTE] ``` DS spec — nodeSelector, tolerations, affinity: ```yaml [PASTE relevant .spec.template.spec] ``` Node count + relevant taints: ``` [PASTE kubectl get nodes + descriptions of missing-pod nodes] ``` Events: ``` [PASTE] ```

Why this prompt works

DaemonSet placement is deterministic given tolerations and selectors, but the failure mode (“only 8 of 10 nodes have it”) looks mysterious without checking each node’s taints. This prompt forces a node-by-node comparison.

How to use it

Compare pods-per-node vs nodes-total. That gap is your debug target.
For each missing node, check taints AND labels. Both matter.
For control plane, the standard taints are well-known; copy the tolerations stanza.
For DSes that go on specific node pools, use BOTH nodeSelector AND tolerations (one constrains, the other tolerates).

Useful commands

# DS view
kubectl get ds -A
kubectl get ds <name> -o yaml
kubectl describe ds <name>

# Per-node placement
kubectl get pods -l app=<label> -o wide --sort-by=.spec.nodeName

# Find missing nodes
NODE_LIST=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
WITH_POD=$(kubectl get pods -l app=<label> -o jsonpath='{.items[*].spec.nodeName}')
echo "$NODE_LIST" | tr ' ' '\n' | sort > /tmp/all
echo "$WITH_POD" | tr ' ' '\n' | sort > /tmp/scheduled
comm -23 /tmp/all /tmp/scheduled       # nodes WITHOUT a DS pod

# Per-node taint inspection
kubectl get nodes -o json | \
    jq '.items[] | {name:.metadata.name, taints:(.spec.taints // [])}'

# Specific node
kubectl describe node <node> | grep -A5 Taints
kubectl describe node <node> | grep -A20 Conditions

# Rollout
kubectl rollout status ds <name>
kubectl rollout history ds <name>
kubectl rollout restart ds <name>

# Pod failures
kubectl get events --field-selector involvedObject.kind=DaemonSet,involvedObject.name=<name>

Standard toleration stanzas

Run on ALL nodes (including control plane)

tolerations:
- key: node-role.kubernetes.io/control-plane
  operator: Exists
  effect: NoSchedule
- key: node-role.kubernetes.io/master
  operator: Exists
  effect: NoSchedule
- operator: Exists       # tolerate everything (use with care)

Run only on GPU nodes (taint + selector)

nodeSelector:
  nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

Run on nodes under pressure (e.g., disk-cleanup DS)

tolerations:
- key: node.kubernetes.io/disk-pressure
  operator: Exists
  effect: NoSchedule
- key: node.kubernetes.io/memory-pressure
  operator: Exists
  effect: NoSchedule

Common findings this catches

DS not on control-plane nodes → no node-role.kubernetes.io/control-plane toleration. Add.
DS missing on GPU nodes → GPU taint not tolerated. Add nvidia.com/gpu toleration + nodeSelector.
DS not on new nodes joining cluster → admission webhook blocking; check webhook logs.
DS pod CrashLoopBackOff on some nodes only → node-specific (kernel, hostPath, hardware); compare with DaemonSet-running nodes that work.
hostPort conflict with another DS on same node → choose unique port per DS.
Pod runs but no per-node “work” happens (e.g., logs not collected) → privileged/securityContext insufficient; check what the DS needs (caps, hostPath access).
DS not rolling out new version → updateStrategy: OnDelete; need to manually delete pods.

Rollout strategies

# Fast rollout (more parallelism)
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 50%
    maxSurge: 0                    # DSes don't surge — 1 per node

# Conservative
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1

# Manual
updateStrategy:
  type: OnDelete
# (then `kubectl delete pod <ds-pod>` to roll forward one node)

When to escalate

DS skipping nodes despite correct tolerations → check kube-scheduler logs for filter rejections; or kube-controller-manager DS controller.
Per-node hostPath issues — coordinate with node owners; may need a privileged init or DaemonSet repair.
Cluster-wide DS controller failures — engage cluster admin; affects all DSes.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

Kubernetes DaemonSet Debug Prompt

Why this prompt works

How to use it

Useful commands

Standard toleration stanzas

Run on ALL nodes (including control plane)

Run only on GPU nodes (taint + selector)

Run on nodes under pressure (e.g., disk-cleanup DS)

Common findings this catches

Rollout strategies

When to escalate

Related prompts

Kubernetes Pod Troubleshooting Prompt

Kubernetes `FailedScheduling` Debug Prompt

Kubernetes Resource Limits & OOMKilled Tuning Prompt

Kubernetes Taints, Tolerations & Node Bin-Packing Prompt

Reading prompts? Get all 500 in one free PDF