Kubernetes GPU & Device Plugin Debug Prompt
Diagnose GPU scheduling — NVIDIA device plugin, MIG, scheduling, image/driver mismatch, pod stuck without GPU.
- Target user
- ML platform engineers running GPU workloads
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior ML platform engineer who has run GPU workloads in production — NVIDIA device plugin, MIG partitioning, time-slicing, driver/CUDA version management. I will provide: - The cluster GPU type (A100, H100, T4, etc.) and topology - Symptom (pod stuck without GPU, can't access GPU, only sees fraction) - Device plugin status Your job: 1. **NVIDIA device plugin**: - DaemonSet on GPU nodes - Discovers GPUs and advertises as schedulable resource (`nvidia.com/gpu`) - Requires NVIDIA driver installed on node - Container Toolkit (nvidia-container-runtime) needed 2. **For pod stuck without GPU**: - Resource request `nvidia.com/gpu: 1` set? - Device plugin advertising resources? - Node has GPU and capacity? - Taint/toleration for GPU node? 3. **For "GPU not visible in container"**: - Container Toolkit installed - Runtime class set to `nvidia` if needed - Driver version compatible with CUDA in container 4. **For MIG (Multi-Instance GPU)** — A100/H100: - GPU partitioned into smaller instances - Device plugin reports `nvidia.com/mig-1g.10gb`, etc. - Pod requests specific profile 5. **For time-slicing**: - Share one GPU across multiple pods - Each gets time slice - Lower throughput but higher utilization 6. **For NUMA**: - GPU + CPU should be on same NUMA node - TopologyManager helps 7. **For driver / CUDA version**: - Container's CUDA version must be ≤ driver version - "Forward compatibility" via cuda-compat packages 8. **For GPU operator**: - Manages driver install, device plugin, toolkit, DCGM exporter - Simplifies setup Mark DESTRUCTIVE: changing MIG config on busy node (evicts all pods), driver downgrade mid-day, removing device plugin (existing pods OK; new can't schedule). --- GPU type + count: [DESCRIBE] Symptom: [DESCRIBE] Device plugin status: ``` [PASTE kubectl get pods -n gpu-operator] ``` Pod spec: ```yaml [PASTE] ```
Why this prompt works
GPU on K8s requires stack alignment — driver, runtime, device plugin, image. This prompt walks them.
How to use it
- Verify driver + device plugin running.
- For pod issues, check tolerations + resource request.
- For MIG, plan partitioning before scheduling.
- Pin compatible CUDA.
Useful commands
# GPU operator pods
kubectl get pods -n gpu-operator
# Device plugin
kubectl get ds -n kube-system | grep nvidia
# Node GPU capacity
kubectl describe node <gpu-node> | grep -A5 Capacity
kubectl describe node <gpu-node> | grep -A20 Allocatable
# GPU resource availability
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}: gpu={.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
# Test GPU pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda
image: nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
kubectl logs gpu-test
# On GPU node
ssh <gpu-node>
nvidia-smi
sudo systemctl status nvidia-persistenced
Patterns
Basic GPU pod
apiVersion: v1
kind: Pod
metadata:
name: train
spec:
containers:
- name: app
image: my-ml-app:cuda12.4
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
accelerator: nvidia-h100
MIG partition (A100/H100)
# Node configured with MIG: 7 × 1g.10gb instances
resources:
limits:
nvidia.com/mig-1g.10gb: 1
Multiple GPUs
resources:
limits:
nvidia.com/gpu: 4 # whole GPUs
Common findings this catches
- Pod FailedScheduling for GPU → no node has capacity or toleration missing.
nvidia-smifails in container → Container Toolkit missing or wrong runtime.- CUDA error mismatch → image CUDA > driver version.
- MIG capacity but pod requests whole GPU → resource type mismatch.
- Device plugin OOM → tune resources.
- GPU not visible despite scheduling → MIG/time-slicing partition; verify node config.
- Pod runs but slow → NUMA mismatch; enable TopologyManager.
When to escalate
- Driver upgrade across fleet — coordinate.
- Mixed GPU types causing scheduling issues — node pool design.
- ML training scaling — engage with Volcano / Kueue.
Related prompts
-
Kubernetes DaemonSet Debug Prompt
Diagnose DaemonSet issues — pods not landing on every node, taint/toleration mismatch, node selector misconfig, rollout strategy debugging.
-
Kubernetes Resource Limits & OOMKilled Tuning Prompt
Tune CPU/memory requests and limits to stop OOMKilled, fix throttling, right-size HPA targets, and avoid noisy-neighbor scheduling issues.
-
Kubernetes `FailedScheduling` Debug Prompt
Diagnose `FailedScheduling` events — taints/tolerations mismatch, node affinity, topology spread skew, resource fit failures, and PV zone constraints.