Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes Deployment Rollout Debug Prompt

Diagnose stuck Deployment rollouts — `ProgressDeadlineExceeded`, replica set churn, maxSurge/maxUnavailable misconfig, image pull pacing, and stuck-mid-rollout recovery.

Target user
Kubernetes platform engineers and SREs
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes engineer who has watched hundreds of rollouts succeed and fail. You know that `kubectl rollout status` lying for 10 minutes usually means the new ReplicaSet's pods aren't passing readiness, not that the cluster is slow.

I will provide:
- The symptom (rollout stuck, partial rollout, old + new pods both alive forever, `ProgressDeadlineExceeded`, rollback didn't restore)
- `kubectl get deploy <name> -o yaml` (especially `status` and `strategy`)
- `kubectl get rs -l app=<label>` (multiple ReplicaSets indicate rollout history)
- `kubectl describe deploy <name>` (Events at bottom)
- Pods of the new ReplicaSet: status, restart counts, readiness probe configs
- Recent change (image bump, env var, configmap reference)

Your job:

1. **Identify the rollout stage**:
   - **New RS at 0 replicas** → controller hasn't started the new RS (rare; check Deployment controller logs)
   - **New RS scaling up, old RS scaling down at the same time** → normal rolling update mid-flight
   - **New RS stuck at N < desired** → new pods not passing readiness (most common)
   - **Both RSes at full replica count** → maxSurge math out of bounds (pre-1.25 bug; rare now) or admission webhook blocking deletion
   - **`ProgressDeadlineExceeded`** → no progress within `progressDeadlineSeconds` (default 600s)
2. **Decode strategy parameters**:
   - **`maxSurge`** — extra pods over desired (default 25% or 1). `maxSurge=0` requires terminating old before creating new.
   - **`maxUnavailable`** — pods allowed below desired (default 25% or 1). `maxUnavailable=0` is "always at full capacity" — both must be > 0 unless `maxSurge > 0`.
   - **Common bad combo**: both `0` → rollout cannot make any move; stuck forever.
3. **For new pods not becoming Ready**:
   - `kubectl describe pod <new-pod>` Events — image pull, scheduling, readiness probe failures
   - Readiness probe wrong: path returns 200 but later than `initialDelaySeconds` allows
   - readinessGate from a controller (e.g., service-mesh) blocking
   - Pod stuck in `ContainerCreating` → volume, image pull, sidecar init
4. **For "rolled out" status but old pods linger**:
   - Pod Disruption Budget blocking eviction of old pods
   - Finalizer on the old pod
   - Manual `kubectl edit` left a stray field
5. **For rollback gone wrong**:
   - `kubectl rollout undo deploy <name>` returns to previous RS — but the prev RS's pod template might also be broken
   - `kubectl rollout history deploy <name>` shows revisions; `--revision=N` for specific
   - Each revision is a separate RS; check pod-template-hash to verify which RS is "current"
6. **For replica-set churn** (many old RSes accumulating):
   - `revisionHistoryLimit` (default 10) caps how many old RSes are kept
   - High churn = many image pushes / config changes per day; consider lowering retention
7. **For mid-rollout stuck and no obvious cause**:
   - Pause and inspect: `kubectl rollout pause deploy <name>`
   - Resume after fix: `kubectl rollout resume deploy <name>`

Mark DESTRUCTIVE: `kubectl rollout undo` to a revision whose pod template is broken (rolls back to non-working state), changing `maxSurge`/`maxUnavailable` mid-rollout, deleting the new RS by hand (controller will recreate).

---

Deployment + namespace: [DESCRIBE]
Symptom: [DESCRIBE]
`kubectl describe deploy <name>` (esp. Events):
```
[PASTE]
```
`kubectl get rs -l <selector>`:
```
[PASTE]
```
New RS pod sample `kubectl describe pod <pod>`:
```
[PASTE]
```
Strategy from `kubectl get deploy <name> -o yaml`:
```yaml
[PASTE .spec.strategy]
```

Why this prompt works

“Rollout stuck” can be image pull, readiness probe failure, PDB, admission webhook, or strategy misconfig — but kubectl rollout status doesn’t say which. This prompt forces a Deployment → ReplicaSet → Pod walk to find the actual blocker.

How to use it

  1. Always check the new RS’s pods first. If they’re not Ready, that’s the rollout block.
  2. kubectl describe deploy shows Events at the bottom — most informative.
  3. List ReplicaSets: kubectl get rs -l <selector> shows old + new with desired/current/ready.
  4. For “ProgressDeadlineExceeded”, look at why no progress happened — pods stuck somewhere upstream.

Useful commands

# Rollout status
kubectl rollout status deploy <name> --timeout=2m
kubectl rollout history deploy <name>
kubectl rollout history deploy <name> --revision=N
kubectl rollout pause deploy <name>
kubectl rollout resume deploy <name>
kubectl rollout undo deploy <name>
kubectl rollout undo deploy <name> --to-revision=N

# Deployment state
kubectl describe deploy <name>
kubectl get deploy <name> -o yaml | yq '.status'
kubectl get deploy <name> -o yaml | yq '.spec.strategy'

# ReplicaSet view (multiple = mid-rollout or churn)
kubectl get rs -l app=<label> --show-labels
kubectl describe rs <new-rs>

# New pod investigation
kubectl get pods -l pod-template-hash=<hash>
kubectl describe pod <new-pod>
kubectl logs <new-pod> --previous

# Readiness probe verification
kubectl get pod <pod> -o yaml | yq '.spec.containers[].readinessProbe'

# Pod disruption budgets blocking
kubectl get pdb -A
kubectl describe pdb <pdb>

# Force rollout (touch pod template to trigger new RS)
kubectl rollout restart deploy <name>

Common findings this catches

  • New pods stuck in ImagePullBackOff → see imagepull-debugging.
  • Readiness probe path wrong → app /healthz returns 404; probe fails; new pods never Ready.
  • maxUnavailable: 0 + maxSurge: 0 → impossible rollout. Change to 25%/25%.
  • PDB with minAvailable: 100% → can’t drain old pods. Lower PDB or raise replicas first.
  • progressDeadlineSeconds: 60 on a slow-start app → falsely flagged as failed. Raise.
  • Stuck after a config map change without kubectl rollout restart → pods still use old configmap (mounted) or env reference. Add a rollout trigger.
  • Multiple old RSes accumulatingrevisionHistoryLimit not set; cluster bloat.

Verify a rollout safely

# 1. Pre-flight: lint the diff
diff <(kubectl get deploy <name> -o yaml | yq '.spec.template') new-template.yaml

# 2. Apply with monitoring
kubectl apply -f deploy.yaml
kubectl rollout status deploy <name> --timeout=10m

# 3. If concerns, pause early
kubectl rollout pause deploy <name>
# Inspect, then:
kubectl rollout resume deploy <name>

# 4. Emergency rollback
kubectl rollout undo deploy <name>

When to escalate

  • Rollout stuck across many Deployments — likely cluster-wide issue (admission webhook, kube-controller-manager problem).
  • Pods passing readiness in isolation but failing as a group → service-mesh routing or readinessGate from another controller; coordinate with mesh owner.
  • ReplicaSet controller errors in kube-controller-manager logs — engage cluster admin.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.