You are a senior ML platform engineer who has run GPU workloads in production — NVIDIA device plugin, MIG partitioning, time-slicing, driver/CUDA version management. I will provide: - The cluster GPU type (A100, H100, T4, etc.) and topology - Symptom (pod stuck without GPU, can't access GPU, only sees fraction) - Device plugin status Your job: 1. **NVIDIA device plugin**: - DaemonSet on GPU nodes - Discovers GPUs and advertises as schedulable resource (`nvidia.com/gpu`) - Requires NVIDIA driver installed on node - Container Toolkit (nvidia-container-runtime) needed 2. **For pod stuck without GPU**: - Resource request `nvidia.com/gpu: 1` set? - Device plugin advertising resources? - Node has GPU and capacity? - Taint/toleration for GPU node? 3. **For "GPU not visible in container"**: - Container Toolkit installed - Runtime class set to `nvidia` if needed - Driver version compatible with CUDA in container 4. **For MIG (Multi-Instance GPU)** — A100/H100: - GPU partitioned into smaller instances - Device plugin reports `nvidia.com/mig-1g.10gb`, etc. - Pod requests specific profile 5. **For time-slicing**: - Share one GPU across multiple pods - Each gets time slice - Lower throughput but higher utilization 6. **For NUMA**: - GPU + CPU should be on same NUMA node - TopologyManager helps 7. **For driver / CUDA version**: - Container's CUDA version must be ≤ driver version - "Forward compatibility" via cuda-compat packages 8. **For GPU operator**: - Manages driver install, device plugin, toolkit, DCGM exporter - Simplifies setup Mark DESTRUCTIVE: changing MIG config on busy node (evicts all pods), driver downgrade mid-day, removing device plugin (existing pods OK; new can't schedule). --- GPU type + count: [DESCRIBE] Symptom: [DESCRIBE] Device plugin status: ``` [PASTE kubectl get pods -n gpu-operator] ``` Pod spec: ```yaml [PASTE] ```

Why this prompt works

GPU on K8s requires stack alignment — driver, runtime, device plugin, image. This prompt walks them.

How to use it

Verify driver + device plugin running.
For pod issues, check tolerations + resource request.
For MIG, plan partitioning before scheduling.
Pin compatible CUDA.

Useful commands

# GPU operator pods
kubectl get pods -n gpu-operator

# Device plugin
kubectl get ds -n kube-system | grep nvidia

# Node GPU capacity
kubectl describe node <gpu-node> | grep -A5 Capacity
kubectl describe node <gpu-node> | grep -A20 Allocatable

# GPU resource availability
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}: gpu={.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'

# Test GPU pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda
    image: nvidia/cuda:12.4.1-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF
kubectl logs gpu-test

# On GPU node
ssh <gpu-node>
nvidia-smi
sudo systemctl status nvidia-persistenced

Patterns

Basic GPU pod

apiVersion: v1
kind: Pod
metadata:
  name: train
spec:
  containers:
  - name: app
    image: my-ml-app:cuda12.4
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  nodeSelector:
    accelerator: nvidia-h100

MIG partition (A100/H100)

# Node configured with MIG: 7 × 1g.10gb instances
resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

Multiple GPUs

resources:
  limits:
    nvidia.com/gpu: 4         # whole GPUs

Common findings this catches

Pod FailedScheduling for GPU → no node has capacity or toleration missing.
nvidia-smi fails in container → Container Toolkit missing or wrong runtime.
CUDA error mismatch → image CUDA > driver version.
MIG capacity but pod requests whole GPU → resource type mismatch.
Device plugin OOM → tune resources.
GPU not visible despite scheduling → MIG/time-slicing partition; verify node config.
Pod runs but slow → NUMA mismatch; enable TopologyManager.

When to escalate

Driver upgrade across fleet — coordinate.
Mixed GPU types causing scheduling issues — node pool design.
ML training scaling — engage with Volcano / Kueue.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

Kubernetes GPU & Device Plugin Debug Prompt

Why this prompt works

How to use it

Useful commands

Patterns

Basic GPU pod

MIG partition (A100/H100)

Multiple GPUs

Common findings this catches

When to escalate

Related prompts

Kubernetes `FailedScheduling` Debug Prompt

Kubernetes DaemonSet Debug Prompt

Kubernetes Resource Limits & OOMKilled Tuning Prompt

Kubernetes Dynamic Resource Allocation (DRA) Design Prompt

Reading prompts? Get all 500 in one free PDF