Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes DaemonSet Debug Prompt

Diagnose DaemonSet issues — pods not landing on every node, taint/toleration mismatch, node selector misconfig, rollout strategy debugging.

Target user
Kubernetes platform engineers running per-node workloads (CNI, logging, monitoring)
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes engineer who has deployed many DaemonSets in production — CNI agents, log collectors (Fluent Bit), monitoring (node-exporter, cAdvisor), security (Falco), storage (CSI node drivers). You know that a DaemonSet skipping nodes is almost always a toleration issue.

I will provide:
- The DaemonSet (`kubectl get ds <name> -o yaml`) — focus on `nodeSelector`, `tolerations`, `affinity`
- `kubectl get ds <name>` showing DESIRED / CURRENT / READY / UP-TO-DATE / NODE-SELECTOR
- Node count and node taints: `kubectl get nodes -o wide` and `kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, taints:.spec.taints}'`
- For nodes missing the pod: their taints, labels, and conditions
- Recent events: `kubectl get events --field-selector involvedObject.kind=DaemonSet,involvedObject.name=<ds>`

Your job:

1. **Compute expected pod placement**:
   - A DaemonSet pod runs on a node IFF the pod tolerates the node's taints AND matches the node's labels (per nodeSelector/affinity)
   - **Built-in tolerations**: DaemonSets get implicit tolerations for `node.kubernetes.io/not-ready` and `node.kubernetes.io/unreachable` (so they survive failover blips)
   - Other taints (`node-role.kubernetes.io/control-plane:NoSchedule`, custom taints, `node.kubernetes.io/disk-pressure`) require explicit tolerations
2. **For "DS has fewer pods than nodes"**:
   - Find which nodes don't have a pod: compare `kubectl get pods -l ... -o wide` to `kubectl get nodes`
   - On each missing node: check taints (`kubectl describe node <node> | grep -A5 Taints`)
   - Verify DS tolerations match those taints
   - Check nodeSelector — if node lacks required label, pod won't schedule
3. **For control plane nodes excluded**:
   - Control-plane has `node-role.kubernetes.io/control-plane:NoSchedule` taint (and historically `master:NoSchedule`)
   - To run on control plane:
     ```yaml
     tolerations:
     - key: node-role.kubernetes.io/control-plane
       operator: Exists
       effect: NoSchedule
     - key: node-role.kubernetes.io/master
       operator: Exists
       effect: NoSchedule
     ```
4. **For new nodes not getting the pod immediately**:
   - DaemonSet controller schedules pods as nodes join — should be within seconds
   - Delay = controller lag, image pull on new node, or admission webhook blocking the DS pod
   - Check `kubectl get events` on the new node
5. **For rollout issues**:
   - `updateStrategy.type: RollingUpdate` (default) — controlled by `maxUnavailable` (default 1)
   - `updateStrategy.type: OnDelete` — manual; pods stay on old image until deleted
   - For fast-rollout DaemonSets, `maxUnavailable: 50%` or higher (assess blast radius)
6. **For DaemonSet pods that schedule but fail**:
   - hostPort conflicts (only one pod per port per node — fine for DS, but not parallel DSes)
   - hostNetwork: true and port already in use on host
   - hostPath mount permissions / missing path
   - Security context too restrictive for the node-level work the DS needs
7. **For DaemonSet on tainted node pools** (e.g., GPU nodes):
   - Tolerate the taint AND set nodeSelector for the specific node label
   - `tolerations` alone may pull the DS to nodes you didn't intend
8. **For sidecar-style DaemonSets** (e.g., service mesh per-node):
   - May need `priorityClassName: system-node-critical` to avoid eviction

Mark DESTRUCTIVE: removing tolerations live (causes immediate pod eviction from tainted nodes), changing the DS selector (`spec.selector` is immutable after creation; you must delete-recreate).

---

DaemonSet: [name + namespace + what it does]
Symptom: [DESCRIBE]
`kubectl get ds <name>`:
```
[PASTE]
```
DS spec — nodeSelector, tolerations, affinity:
```yaml
[PASTE relevant .spec.template.spec]
```
Node count + relevant taints:
```
[PASTE kubectl get nodes + descriptions of missing-pod nodes]
```
Events:
```
[PASTE]
```

Why this prompt works

DaemonSet placement is deterministic given tolerations and selectors, but the failure mode (“only 8 of 10 nodes have it”) looks mysterious without checking each node’s taints. This prompt forces a node-by-node comparison.

How to use it

  1. Compare pods-per-node vs nodes-total. That gap is your debug target.
  2. For each missing node, check taints AND labels. Both matter.
  3. For control plane, the standard taints are well-known; copy the tolerations stanza.
  4. For DSes that go on specific node pools, use BOTH nodeSelector AND tolerations (one constrains, the other tolerates).

Useful commands

# DS view
kubectl get ds -A
kubectl get ds <name> -o yaml
kubectl describe ds <name>

# Per-node placement
kubectl get pods -l app=<label> -o wide --sort-by=.spec.nodeName

# Find missing nodes
NODE_LIST=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
WITH_POD=$(kubectl get pods -l app=<label> -o jsonpath='{.items[*].spec.nodeName}')
echo "$NODE_LIST" | tr ' ' '\n' | sort > /tmp/all
echo "$WITH_POD" | tr ' ' '\n' | sort > /tmp/scheduled
comm -23 /tmp/all /tmp/scheduled       # nodes WITHOUT a DS pod

# Per-node taint inspection
kubectl get nodes -o json | \
    jq '.items[] | {name:.metadata.name, taints:(.spec.taints // [])}'

# Specific node
kubectl describe node <node> | grep -A5 Taints
kubectl describe node <node> | grep -A20 Conditions

# Rollout
kubectl rollout status ds <name>
kubectl rollout history ds <name>
kubectl rollout restart ds <name>

# Pod failures
kubectl get events --field-selector involvedObject.kind=DaemonSet,involvedObject.name=<name>

Standard toleration stanzas

Run on ALL nodes (including control plane)

tolerations:
- key: node-role.kubernetes.io/control-plane
  operator: Exists
  effect: NoSchedule
- key: node-role.kubernetes.io/master
  operator: Exists
  effect: NoSchedule
- operator: Exists       # tolerate everything (use with care)

Run only on GPU nodes (taint + selector)

nodeSelector:
  nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

Run on nodes under pressure (e.g., disk-cleanup DS)

tolerations:
- key: node.kubernetes.io/disk-pressure
  operator: Exists
  effect: NoSchedule
- key: node.kubernetes.io/memory-pressure
  operator: Exists
  effect: NoSchedule

Common findings this catches

  • DS not on control-plane nodes → no node-role.kubernetes.io/control-plane toleration. Add.
  • DS missing on GPU nodes → GPU taint not tolerated. Add nvidia.com/gpu toleration + nodeSelector.
  • DS not on new nodes joining cluster → admission webhook blocking; check webhook logs.
  • DS pod CrashLoopBackOff on some nodes only → node-specific (kernel, hostPath, hardware); compare with DaemonSet-running nodes that work.
  • hostPort conflict with another DS on same node → choose unique port per DS.
  • Pod runs but no per-node “work” happens (e.g., logs not collected) → privileged/securityContext insufficient; check what the DS needs (caps, hostPath access).
  • DS not rolling out new versionupdateStrategy: OnDelete; need to manually delete pods.

Rollout strategies

# Fast rollout (more parallelism)
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 50%
    maxSurge: 0                    # DSes don't surge — 1 per node

# Conservative
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1

# Manual
updateStrategy:
  type: OnDelete
# (then `kubectl delete pod <ds-pod>` to roll forward one node)

When to escalate

  • DS skipping nodes despite correct tolerations → check kube-scheduler logs for filter rejections; or kube-controller-manager DS controller.
  • Per-node hostPath issues — coordinate with node owners; may need a privileged init or DaemonSet repair.
  • Cluster-wide DS controller failures — engage cluster admin; affects all DSes.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.