Kubernetes Deployment Rollout Debug Prompt
Diagnose stuck Deployment rollouts — `ProgressDeadlineExceeded`, replica set churn, maxSurge/maxUnavailable misconfig, image pull pacing, and stuck-mid-rollout recovery.
- Target user
- Kubernetes platform engineers and SREs
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes engineer who has watched hundreds of rollouts succeed and fail. You know that `kubectl rollout status` lying for 10 minutes usually means the new ReplicaSet's pods aren't passing readiness, not that the cluster is slow. I will provide: - The symptom (rollout stuck, partial rollout, old + new pods both alive forever, `ProgressDeadlineExceeded`, rollback didn't restore) - `kubectl get deploy <name> -o yaml` (especially `status` and `strategy`) - `kubectl get rs -l app=<label>` (multiple ReplicaSets indicate rollout history) - `kubectl describe deploy <name>` (Events at bottom) - Pods of the new ReplicaSet: status, restart counts, readiness probe configs - Recent change (image bump, env var, configmap reference) Your job: 1. **Identify the rollout stage**: - **New RS at 0 replicas** → controller hasn't started the new RS (rare; check Deployment controller logs) - **New RS scaling up, old RS scaling down at the same time** → normal rolling update mid-flight - **New RS stuck at N < desired** → new pods not passing readiness (most common) - **Both RSes at full replica count** → maxSurge math out of bounds (pre-1.25 bug; rare now) or admission webhook blocking deletion - **`ProgressDeadlineExceeded`** → no progress within `progressDeadlineSeconds` (default 600s) 2. **Decode strategy parameters**: - **`maxSurge`** — extra pods over desired (default 25% or 1). `maxSurge=0` requires terminating old before creating new. - **`maxUnavailable`** — pods allowed below desired (default 25% or 1). `maxUnavailable=0` is "always at full capacity" — both must be > 0 unless `maxSurge > 0`. - **Common bad combo**: both `0` → rollout cannot make any move; stuck forever. 3. **For new pods not becoming Ready**: - `kubectl describe pod <new-pod>` Events — image pull, scheduling, readiness probe failures - Readiness probe wrong: path returns 200 but later than `initialDelaySeconds` allows - readinessGate from a controller (e.g., service-mesh) blocking - Pod stuck in `ContainerCreating` → volume, image pull, sidecar init 4. **For "rolled out" status but old pods linger**: - Pod Disruption Budget blocking eviction of old pods - Finalizer on the old pod - Manual `kubectl edit` left a stray field 5. **For rollback gone wrong**: - `kubectl rollout undo deploy <name>` returns to previous RS — but the prev RS's pod template might also be broken - `kubectl rollout history deploy <name>` shows revisions; `--revision=N` for specific - Each revision is a separate RS; check pod-template-hash to verify which RS is "current" 6. **For replica-set churn** (many old RSes accumulating): - `revisionHistoryLimit` (default 10) caps how many old RSes are kept - High churn = many image pushes / config changes per day; consider lowering retention 7. **For mid-rollout stuck and no obvious cause**: - Pause and inspect: `kubectl rollout pause deploy <name>` - Resume after fix: `kubectl rollout resume deploy <name>` Mark DESTRUCTIVE: `kubectl rollout undo` to a revision whose pod template is broken (rolls back to non-working state), changing `maxSurge`/`maxUnavailable` mid-rollout, deleting the new RS by hand (controller will recreate). --- Deployment + namespace: [DESCRIBE] Symptom: [DESCRIBE] `kubectl describe deploy <name>` (esp. Events): ``` [PASTE] ``` `kubectl get rs -l <selector>`: ``` [PASTE] ``` New RS pod sample `kubectl describe pod <pod>`: ``` [PASTE] ``` Strategy from `kubectl get deploy <name> -o yaml`: ```yaml [PASTE .spec.strategy] ```
Why this prompt works
“Rollout stuck” can be image pull, readiness probe failure, PDB, admission webhook, or strategy misconfig — but kubectl rollout status doesn’t say which. This prompt forces a Deployment → ReplicaSet → Pod walk to find the actual blocker.
How to use it
- Always check the new RS’s pods first. If they’re not Ready, that’s the rollout block.
kubectl describe deployshows Events at the bottom — most informative.- List ReplicaSets:
kubectl get rs -l <selector>shows old + new with desired/current/ready. - For “ProgressDeadlineExceeded”, look at why no progress happened — pods stuck somewhere upstream.
Useful commands
# Rollout status
kubectl rollout status deploy <name> --timeout=2m
kubectl rollout history deploy <name>
kubectl rollout history deploy <name> --revision=N
kubectl rollout pause deploy <name>
kubectl rollout resume deploy <name>
kubectl rollout undo deploy <name>
kubectl rollout undo deploy <name> --to-revision=N
# Deployment state
kubectl describe deploy <name>
kubectl get deploy <name> -o yaml | yq '.status'
kubectl get deploy <name> -o yaml | yq '.spec.strategy'
# ReplicaSet view (multiple = mid-rollout or churn)
kubectl get rs -l app=<label> --show-labels
kubectl describe rs <new-rs>
# New pod investigation
kubectl get pods -l pod-template-hash=<hash>
kubectl describe pod <new-pod>
kubectl logs <new-pod> --previous
# Readiness probe verification
kubectl get pod <pod> -o yaml | yq '.spec.containers[].readinessProbe'
# Pod disruption budgets blocking
kubectl get pdb -A
kubectl describe pdb <pdb>
# Force rollout (touch pod template to trigger new RS)
kubectl rollout restart deploy <name>
Common findings this catches
- New pods stuck in
ImagePullBackOff→ see imagepull-debugging. - Readiness probe path wrong → app
/healthzreturns 404; probe fails; new pods never Ready. maxUnavailable: 0+maxSurge: 0→ impossible rollout. Change to25%/25%.- PDB with
minAvailable: 100%→ can’t drain old pods. Lower PDB or raise replicas first. progressDeadlineSeconds: 60on a slow-start app → falsely flagged as failed. Raise.- Stuck after a config map change without
kubectl rollout restart→ pods still use old configmap (mounted) or env reference. Add a rollout trigger. - Multiple old RSes accumulating →
revisionHistoryLimitnot set; cluster bloat.
Verify a rollout safely
# 1. Pre-flight: lint the diff
diff <(kubectl get deploy <name> -o yaml | yq '.spec.template') new-template.yaml
# 2. Apply with monitoring
kubectl apply -f deploy.yaml
kubectl rollout status deploy <name> --timeout=10m
# 3. If concerns, pause early
kubectl rollout pause deploy <name>
# Inspect, then:
kubectl rollout resume deploy <name>
# 4. Emergency rollback
kubectl rollout undo deploy <name>
When to escalate
- Rollout stuck across many Deployments — likely cluster-wide issue (admission webhook, kube-controller-manager problem).
- Pods passing readiness in isolation but failing as a group → service-mesh routing or readinessGate from another controller; coordinate with mesh owner.
- ReplicaSet controller errors in
kube-controller-managerlogs — engage cluster admin.
Related prompts
-
CrashLoopBackOff Debugging Prompt
Drill into a specific CrashLoopBackOff failure — application crash, missing config, init container failure, or probe-driven kill — and find the actual cause.
-
Kubernetes Pod Troubleshooting Prompt
Diagnose any misbehaving pod — pending, evicted, networking-broken, storage-stuck, or just plain slow — with a structured AI walkthrough.
-
Kubernetes Resource Limits & OOMKilled Tuning Prompt
Tune CPU/memory requests and limits to stop OOMKilled, fix throttling, right-size HPA targets, and avoid noisy-neighbor scheduling issues.