Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes PV / PVC / CSI Storage Troubleshooting Prompt

Diagnose stuck PVCs, failed pod mounts, StorageClass provisioning errors, CSI driver crashes, and orphaned volume cleanups.

Target user
Kubernetes platform engineers handling persistent storage
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes storage engineer with deep experience operating CSI drivers (EBS, Persistent Disk, Ceph RBD, Longhorn, Portworx, NFS-CSI) in production.

I will provide:
- The symptom: PVC stuck in `Pending`, pod stuck in `ContainerCreating` waiting on volume, `Multi-Attach error`, `failed to provision`, slow I/O, or `volume in use cannot delete`
- PVC + PV YAML (`kubectl get pvc <p> -o yaml`, `kubectl get pv <pv> -o yaml`)
- StorageClass: `kubectl get sc <sc> -o yaml`
- Recent events on the PVC/pod
- CSI driver pods (logs from the affected node's `csi-node-*` and the cluster-wide `csi-provisioner` / `csi-attacher` / `csi-resizer`)
- Pod spec with the volume mount
- The CSI driver name + version

Your job:

1. **Walk the PVC lifecycle** to find the failing stage:
   - `Pending` waiting on provisioner → check `csi-provisioner` logs; usually quota, IAM, or invalid params
   - `Bound` but pod stuck `ContainerCreating` → attach phase (`csi-attacher`) or mount phase (`csi-node` on the affected node)
   - `Multi-Attach error for volume...` → `ReadWriteOnce` PV is being attached to a second node before the first node releases (often when pod moves)
   - `Bound` and mounted, but slow → backend issue, not Kubernetes
2. **Match access modes correctly**:
   - `RWO` → one node at a time; pod moves between nodes need force-detach
   - `RWX` → multiple nodes; CSI driver must support it (NFS yes, EBS no, Ceph CephFS yes)
   - `ROX` → rare; multiple readers only
   - `RWOP` (1.27+) → ReadWriteOncePod — one POD, even tighter than RWO
3. **Decode StorageClass parameters**:
   - `volumeBindingMode: WaitForFirstConsumer` vs `Immediate`
     - Immediate provisions at PVC create (may bind to wrong AZ on cloud)
     - WaitForFirstConsumer waits for pod scheduling to know zone; preferred for zonal disks
   - `reclaimPolicy: Delete` vs `Retain` — `Delete` removes the backend volume on PVC delete; `Retain` keeps it (orphaned)
   - Provisioner-specific params: `fsType`, `iops`, `throughput`, encryption, tags
4. **For stuck deletes**: check the finalizer list on PVC and PV. `kubernetes.io/pvc-protection` and `kubernetes.io/pv-protection` block deletion until referencers go away.
5. **For Multi-Attach errors**:
   - Old node went down hard; volume still "attached" in cloud API
   - Force-detach is risky — write-cache loss possible
   - Newer K8s + CSI handle this with `node.kubernetes.io/out-of-service` taint on the dead node
6. **For CSI driver crashes**: check pod logs, restart counts, RBAC for the csi service account, volume attachment limits per node (`maxVolumesPerNode`).
7. **For slow I/O**: validate it's a Kubernetes-layer issue and not just the backend. `kubectl exec` and `fio` / `dd` from within the pod tells you the real bandwidth.
8. Mark every DESTRUCTIVE action clearly: editing PV `reclaimPolicy` from Delete to Retain mid-flight (good idea, but timing matters), force-removing finalizers (orphans backend), deleting a Bound PVC.

---

CSI driver + version: [e.g., ebs.csi.aws.com v1.30]
Cluster context: [cloud provider / on-prem / k3s, etc.]
Symptom: [DESCRIBE]
PVC YAML:
```yaml
[PASTE]
```
PV YAML (if bound):
```yaml
[PASTE]
```
StorageClass YAML:
```yaml
[PASTE]
```
Events on PVC + pod:
```
[PASTE kubectl describe pvc + kubectl describe pod]
```
CSI logs (controller + node on affected node):
```
[PASTE]
```

Why this prompt works

Storage failures in Kubernetes cross at least three components: the PVC controller (kube-controller-manager), the CSI driver (cluster-wide + per-node), and the backend storage system. The visible state (Pending, ContainerCreating) doesn’t tell you which one failed. This prompt forces a stage-aware diagnosis and flags the destructive recovery actions.

How to use it

  1. Always include the StorageClass YAML — half of “stuck Pending” PVCs are wrong volumeBindingMode for a zonal disk.
  2. For multi-node clusters with zonal disks, mention the AZ of the pod’s target node. If the disk is in a different zone, no amount of waiting fixes it.
  3. Include both controller-side and node-side CSI logs — the failure is often on the node, but the user only checks the controller.
  4. For “Multi-Attach error”, mention what happened to the previous pod (node went down? rolling deploy?).

Useful commands

# PVC + PV state
kubectl get pvc -A
kubectl describe pvc <pvc> -n <ns>
kubectl get pv <pv> -o yaml
kubectl describe pv <pv>

# StorageClass
kubectl get sc
kubectl describe sc <sc>

# CSI driver pods
kubectl get pods -n kube-system -l app.kubernetes.io/name=<csi-driver>
kubectl logs -n kube-system <csi-controller-pod> -c csi-provisioner --tail=200
kubectl logs -n kube-system <csi-controller-pod> -c csi-attacher --tail=200
kubectl logs -n kube-system <csi-node-pod-on-affected-node> --tail=200

# Volume attachments
kubectl get volumeattachment
kubectl get volumeattachment <va> -o yaml

# Pod-level mount errors
kubectl describe pod <pod>          # look for Events section
kubectl get events --field-selector involvedObject.name=<pod>

# In-pod I/O test
kubectl exec -n <ns> <pod> -- df -h /data
kubectl exec -n <ns> <pod> -- dd if=/dev/zero of=/data/test bs=1M count=100 oflag=direct

# CSI driver capabilities
kubectl get csidriver
kubectl get csidriver <name> -o yaml

# Snapshots (if VolumeSnapshot enabled)
kubectl get volumesnapshot -A
kubectl get volumesnapshotclass

# Stuck delete — see what's holding it
kubectl get pvc <pvc> -o yaml | grep -A5 finalizers
kubectl get pv <pv> -o yaml | grep -A5 finalizers

Decision matrix

SymptomWhere to look first
PVC Pending immediatelycsi-provisioner logs — IAM/quota/params
PVC Pending but pod scheduledvolumeBindingMode: WaitForFirstConsumer — normal, will resolve
Pod ContainerCreating for >2mcsi-attacher (controller) and csi-node on the target node
Multi-Attach errorPrevious attachment not released; check old pod’s node
failed to provision volumeCSI provisioner; check StorageClass params + cloud quota
Slow I/O inside podBackend, not K8s; test from another mount of same backend
volume in use, cannot deleteFinalizers; check `kubectl get pvc/pv -o yaml

Common findings this catches

  • PVC Pending with no events on cloud clustervolumeBindingMode: Immediate + zonal SC, but no node available in the SC’s zone. Switch to WaitForFirstConsumer.
  • Pod ContainerCreating after node failure — RWO volume still attached to dead node; cloud API thinks it’s busy. Taint dead node with node.kubernetes.io/out-of-service:NoExecute to trigger CSI cleanup (K8s 1.26+).
  • PVC stuck Terminating — pod still using it. kubectl get pods -A | grep <pvc> finds the holder; once removed, finalizer releases.
  • PV stuck ReleasedreclaimPolicy: Retain left it; admin must kubectl patch pv <pv> -p '{"spec":{"claimRef": null}}' to reuse, or delete to clean up.
  • CSI controller missing IAM permissionscsi-provisioner logs show AccessDenied. Common after IAM role changes.
  • maxVolumesPerNode reached — on AWS, EBS has per-instance attach limits. New pods stuck ContainerCreating even with PVCs already bound.

Recovery patterns

Recover from stuck Terminating PVC after pod gone

# 1. Confirm no pod uses it
kubectl get pods -A -o jsonpath='{range .items[*]}{range .spec.volumes[*]}{.persistentVolumeClaim.claimName}{"\n"}{end}{end}' | grep <pvc>

# 2. If truly orphaned, finalizer should auto-clear once pod is gone.
#    If not (PVC stuck in Terminating after pod delete):
kubectl get pvc <pvc> -o yaml | grep finalizers   # confirm "kubernetes.io/pvc-protection"
# Manual removal (DESTRUCTIVE if backend not actually free):
kubectl patch pvc <pvc> -n <ns> -p '{"metadata":{"finalizers":null}}'

Switch reclaimPolicy on existing PVs

kubectl get pv | awk '/<class-name>/ {print $1}' | xargs -I{} kubectl patch pv {} \
  -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

Expand a PVC

# Requires `allowVolumeExpansion: true` in StorageClass
kubectl patch pvc <pvc> -n <ns> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
# Expansion may require pod restart; watch the FileSystemResizePending condition
kubectl describe pvc <pvc>

When to escalate

  • CSI driver pods crashing repeatedly — engage the CSI driver maintainer’s support (cloud provider or vendor); usually a quota or RBAC issue, but can be a driver bug.
  • Backend storage system in degraded state — fix the backend before retrying K8s operations.
  • Data integrity concerns after a forced detach — restore from snapshot rather than trusting fsck on a volume with potential cache loss.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.