Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes GPU & Device Plugin Debug Prompt

Diagnose GPU scheduling — NVIDIA device plugin, MIG, scheduling, image/driver mismatch, pod stuck without GPU.

Target user
ML platform engineers running GPU workloads
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior ML platform engineer who has run GPU workloads in production — NVIDIA device plugin, MIG partitioning, time-slicing, driver/CUDA version management.

I will provide:
- The cluster GPU type (A100, H100, T4, etc.) and topology
- Symptom (pod stuck without GPU, can't access GPU, only sees fraction)
- Device plugin status

Your job:

1. **NVIDIA device plugin**:
   - DaemonSet on GPU nodes
   - Discovers GPUs and advertises as schedulable resource (`nvidia.com/gpu`)
   - Requires NVIDIA driver installed on node
   - Container Toolkit (nvidia-container-runtime) needed
2. **For pod stuck without GPU**:
   - Resource request `nvidia.com/gpu: 1` set?
   - Device plugin advertising resources?
   - Node has GPU and capacity?
   - Taint/toleration for GPU node?
3. **For "GPU not visible in container"**:
   - Container Toolkit installed
   - Runtime class set to `nvidia` if needed
   - Driver version compatible with CUDA in container
4. **For MIG (Multi-Instance GPU)** — A100/H100:
   - GPU partitioned into smaller instances
   - Device plugin reports `nvidia.com/mig-1g.10gb`, etc.
   - Pod requests specific profile
5. **For time-slicing**:
   - Share one GPU across multiple pods
   - Each gets time slice
   - Lower throughput but higher utilization
6. **For NUMA**:
   - GPU + CPU should be on same NUMA node
   - TopologyManager helps
7. **For driver / CUDA version**:
   - Container's CUDA version must be ≤ driver version
   - "Forward compatibility" via cuda-compat packages
8. **For GPU operator**:
   - Manages driver install, device plugin, toolkit, DCGM exporter
   - Simplifies setup

Mark DESTRUCTIVE: changing MIG config on busy node (evicts all pods), driver downgrade mid-day, removing device plugin (existing pods OK; new can't schedule).

---

GPU type + count: [DESCRIBE]
Symptom: [DESCRIBE]
Device plugin status:
```
[PASTE kubectl get pods -n gpu-operator]
```
Pod spec:
```yaml
[PASTE]
```

Why this prompt works

GPU on K8s requires stack alignment — driver, runtime, device plugin, image. This prompt walks them.

How to use it

  1. Verify driver + device plugin running.
  2. For pod issues, check tolerations + resource request.
  3. For MIG, plan partitioning before scheduling.
  4. Pin compatible CUDA.

Useful commands

# GPU operator pods
kubectl get pods -n gpu-operator

# Device plugin
kubectl get ds -n kube-system | grep nvidia

# Node GPU capacity
kubectl describe node <gpu-node> | grep -A5 Capacity
kubectl describe node <gpu-node> | grep -A20 Allocatable

# GPU resource availability
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}: gpu={.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'

# Test GPU pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda
    image: nvidia/cuda:12.4.1-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF
kubectl logs gpu-test

# On GPU node
ssh <gpu-node>
nvidia-smi
sudo systemctl status nvidia-persistenced

Patterns

Basic GPU pod

apiVersion: v1
kind: Pod
metadata:
  name: train
spec:
  containers:
  - name: app
    image: my-ml-app:cuda12.4
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  nodeSelector:
    accelerator: nvidia-h100

MIG partition (A100/H100)

# Node configured with MIG: 7 × 1g.10gb instances
resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

Multiple GPUs

resources:
  limits:
    nvidia.com/gpu: 4         # whole GPUs

Common findings this catches

  • Pod FailedScheduling for GPU → no node has capacity or toleration missing.
  • nvidia-smi fails in container → Container Toolkit missing or wrong runtime.
  • CUDA error mismatch → image CUDA > driver version.
  • MIG capacity but pod requests whole GPU → resource type mismatch.
  • Device plugin OOM → tune resources.
  • GPU not visible despite scheduling → MIG/time-slicing partition; verify node config.
  • Pod runs but slow → NUMA mismatch; enable TopologyManager.

When to escalate

  • Driver upgrade across fleet — coordinate.
  • Mixed GPU types causing scheduling issues — node pool design.
  • ML training scaling — engage with Volcano / Kueue.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.