Kubernetes PV / PVC / CSI Storage Troubleshooting Prompt
Diagnose stuck PVCs, failed pod mounts, StorageClass provisioning errors, CSI driver crashes, and orphaned volume cleanups.
- Target user
- Kubernetes platform engineers handling persistent storage
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes storage engineer with deep experience operating CSI drivers (EBS, Persistent Disk, Ceph RBD, Longhorn, Portworx, NFS-CSI) in production.
I will provide:
- The symptom: PVC stuck in `Pending`, pod stuck in `ContainerCreating` waiting on volume, `Multi-Attach error`, `failed to provision`, slow I/O, or `volume in use cannot delete`
- PVC + PV YAML (`kubectl get pvc <p> -o yaml`, `kubectl get pv <pv> -o yaml`)
- StorageClass: `kubectl get sc <sc> -o yaml`
- Recent events on the PVC/pod
- CSI driver pods (logs from the affected node's `csi-node-*` and the cluster-wide `csi-provisioner` / `csi-attacher` / `csi-resizer`)
- Pod spec with the volume mount
- The CSI driver name + version
Your job:
1. **Walk the PVC lifecycle** to find the failing stage:
- `Pending` waiting on provisioner → check `csi-provisioner` logs; usually quota, IAM, or invalid params
- `Bound` but pod stuck `ContainerCreating` → attach phase (`csi-attacher`) or mount phase (`csi-node` on the affected node)
- `Multi-Attach error for volume...` → `ReadWriteOnce` PV is being attached to a second node before the first node releases (often when pod moves)
- `Bound` and mounted, but slow → backend issue, not Kubernetes
2. **Match access modes correctly**:
- `RWO` → one node at a time; pod moves between nodes need force-detach
- `RWX` → multiple nodes; CSI driver must support it (NFS yes, EBS no, Ceph CephFS yes)
- `ROX` → rare; multiple readers only
- `RWOP` (1.27+) → ReadWriteOncePod — one POD, even tighter than RWO
3. **Decode StorageClass parameters**:
- `volumeBindingMode: WaitForFirstConsumer` vs `Immediate`
- Immediate provisions at PVC create (may bind to wrong AZ on cloud)
- WaitForFirstConsumer waits for pod scheduling to know zone; preferred for zonal disks
- `reclaimPolicy: Delete` vs `Retain` — `Delete` removes the backend volume on PVC delete; `Retain` keeps it (orphaned)
- Provisioner-specific params: `fsType`, `iops`, `throughput`, encryption, tags
4. **For stuck deletes**: check the finalizer list on PVC and PV. `kubernetes.io/pvc-protection` and `kubernetes.io/pv-protection` block deletion until referencers go away.
5. **For Multi-Attach errors**:
- Old node went down hard; volume still "attached" in cloud API
- Force-detach is risky — write-cache loss possible
- Newer K8s + CSI handle this with `node.kubernetes.io/out-of-service` taint on the dead node
6. **For CSI driver crashes**: check pod logs, restart counts, RBAC for the csi service account, volume attachment limits per node (`maxVolumesPerNode`).
7. **For slow I/O**: validate it's a Kubernetes-layer issue and not just the backend. `kubectl exec` and `fio` / `dd` from within the pod tells you the real bandwidth.
8. Mark every DESTRUCTIVE action clearly: editing PV `reclaimPolicy` from Delete to Retain mid-flight (good idea, but timing matters), force-removing finalizers (orphans backend), deleting a Bound PVC.
---
CSI driver + version: [e.g., ebs.csi.aws.com v1.30]
Cluster context: [cloud provider / on-prem / k3s, etc.]
Symptom: [DESCRIBE]
PVC YAML:
```yaml
[PASTE]
```
PV YAML (if bound):
```yaml
[PASTE]
```
StorageClass YAML:
```yaml
[PASTE]
```
Events on PVC + pod:
```
[PASTE kubectl describe pvc + kubectl describe pod]
```
CSI logs (controller + node on affected node):
```
[PASTE]
```
Why this prompt works
Storage failures in Kubernetes cross at least three components: the PVC controller (kube-controller-manager), the CSI driver (cluster-wide + per-node), and the backend storage system. The visible state (Pending, ContainerCreating) doesn’t tell you which one failed. This prompt forces a stage-aware diagnosis and flags the destructive recovery actions.
How to use it
- Always include the StorageClass YAML — half of “stuck Pending” PVCs are wrong
volumeBindingModefor a zonal disk. - For multi-node clusters with zonal disks, mention the AZ of the pod’s target node. If the disk is in a different zone, no amount of waiting fixes it.
- Include both controller-side and node-side CSI logs — the failure is often on the node, but the user only checks the controller.
- For “Multi-Attach error”, mention what happened to the previous pod (node went down? rolling deploy?).
Useful commands
# PVC + PV state
kubectl get pvc -A
kubectl describe pvc <pvc> -n <ns>
kubectl get pv <pv> -o yaml
kubectl describe pv <pv>
# StorageClass
kubectl get sc
kubectl describe sc <sc>
# CSI driver pods
kubectl get pods -n kube-system -l app.kubernetes.io/name=<csi-driver>
kubectl logs -n kube-system <csi-controller-pod> -c csi-provisioner --tail=200
kubectl logs -n kube-system <csi-controller-pod> -c csi-attacher --tail=200
kubectl logs -n kube-system <csi-node-pod-on-affected-node> --tail=200
# Volume attachments
kubectl get volumeattachment
kubectl get volumeattachment <va> -o yaml
# Pod-level mount errors
kubectl describe pod <pod> # look for Events section
kubectl get events --field-selector involvedObject.name=<pod>
# In-pod I/O test
kubectl exec -n <ns> <pod> -- df -h /data
kubectl exec -n <ns> <pod> -- dd if=/dev/zero of=/data/test bs=1M count=100 oflag=direct
# CSI driver capabilities
kubectl get csidriver
kubectl get csidriver <name> -o yaml
# Snapshots (if VolumeSnapshot enabled)
kubectl get volumesnapshot -A
kubectl get volumesnapshotclass
# Stuck delete — see what's holding it
kubectl get pvc <pvc> -o yaml | grep -A5 finalizers
kubectl get pv <pv> -o yaml | grep -A5 finalizers
Decision matrix
| Symptom | Where to look first |
|---|---|
PVC Pending immediately | csi-provisioner logs — IAM/quota/params |
PVC Pending but pod scheduled | volumeBindingMode: WaitForFirstConsumer — normal, will resolve |
Pod ContainerCreating for >2m | csi-attacher (controller) and csi-node on the target node |
Multi-Attach error | Previous attachment not released; check old pod’s node |
failed to provision volume | CSI provisioner; check StorageClass params + cloud quota |
| Slow I/O inside pod | Backend, not K8s; test from another mount of same backend |
volume in use, cannot delete | Finalizers; check `kubectl get pvc/pv -o yaml |
Common findings this catches
- PVC
Pendingwith no events on cloud cluster —volumeBindingMode: Immediate+ zonal SC, but no node available in the SC’s zone. Switch toWaitForFirstConsumer. - Pod
ContainerCreatingafter node failure — RWO volume still attached to dead node; cloud API thinks it’s busy. Taint dead node withnode.kubernetes.io/out-of-service:NoExecuteto trigger CSI cleanup (K8s 1.26+). - PVC stuck Terminating — pod still using it.
kubectl get pods -A | grep <pvc>finds the holder; once removed, finalizer releases. - PV stuck Released —
reclaimPolicy: Retainleft it; admin mustkubectl patch pv <pv> -p '{"spec":{"claimRef": null}}'to reuse, or delete to clean up. - CSI controller missing IAM permissions —
csi-provisionerlogs showAccessDenied. Common after IAM role changes. maxVolumesPerNodereached — on AWS, EBS has per-instance attach limits. New pods stuckContainerCreatingeven with PVCs already bound.
Recovery patterns
Recover from stuck Terminating PVC after pod gone
# 1. Confirm no pod uses it
kubectl get pods -A -o jsonpath='{range .items[*]}{range .spec.volumes[*]}{.persistentVolumeClaim.claimName}{"\n"}{end}{end}' | grep <pvc>
# 2. If truly orphaned, finalizer should auto-clear once pod is gone.
# If not (PVC stuck in Terminating after pod delete):
kubectl get pvc <pvc> -o yaml | grep finalizers # confirm "kubernetes.io/pvc-protection"
# Manual removal (DESTRUCTIVE if backend not actually free):
kubectl patch pvc <pvc> -n <ns> -p '{"metadata":{"finalizers":null}}'
Switch reclaimPolicy on existing PVs
kubectl get pv | awk '/<class-name>/ {print $1}' | xargs -I{} kubectl patch pv {} \
-p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
Expand a PVC
# Requires `allowVolumeExpansion: true` in StorageClass
kubectl patch pvc <pvc> -n <ns> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
# Expansion may require pod restart; watch the FileSystemResizePending condition
kubectl describe pvc <pvc>
When to escalate
- CSI driver pods crashing repeatedly — engage the CSI driver maintainer’s support (cloud provider or vendor); usually a quota or RBAC issue, but can be a driver bug.
- Backend storage system in degraded state — fix the backend before retrying K8s operations.
- Data integrity concerns after a forced detach — restore from snapshot rather than trusting fsck on a volume with potential cache loss.
Related prompts
-
Cinder Volume Troubleshooting Prompt
Diagnose stuck volumes, failed attachments, and backend issues (Ceph/LVM/iSCSI/NFS) in OpenStack Cinder using CLI output and service logs.
-
Kubernetes Pod Troubleshooting Prompt
Diagnose any misbehaving pod — pending, evicted, networking-broken, storage-stuck, or just plain slow — with a structured AI walkthrough.
-
Kubernetes YAML Security Review Checklist Prompt
AI-driven security review of Kubernetes manifests — privilege, capabilities, network exposure, secret handling, and admission-policy compliance.