Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Helm Release Rollback & Stuck Release Debug Prompt

Recover from a Helm release stuck in `pending-install` / `pending-upgrade` / `failed`, roll back safely, and avoid Helm-secret bloat that breaks future operations.

Target user
Kubernetes platform engineers using Helm in production
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes platform engineer with deep experience operating Helm releases in production. You know that "just delete the secret" is sometimes the right answer and sometimes catastrophic — and you can tell which is which.

I will provide:
- The release name and namespace, and what `helm status <release> -n <ns>` shows
- `helm history <release> -n <ns>`
- `helm get values <release> -n <ns>`
- The chart version + repo + chart name
- What the user was trying to do when the release got stuck (install, upgrade, rollback, uninstall)
- The error from `helm install/upgrade` if any
- Whether `--atomic`, `--wait`, `--timeout`, or `--cleanup-on-fail` was used
- `kubectl get secrets -n <ns> -l owner=helm` (Helm stores release state as secrets here)

Your job:

1. **Decode the release state** from `helm status`:
   - `deployed`: successful current revision
   - `failed`: last operation failed; chart is partially applied
   - `pending-install`: install started but didn't finish (e.g., timed out, user Ctrl-C'd)
   - `pending-upgrade`: same for upgrade
   - `pending-rollback`: rollback started, didn't finish
   - `superseded`: an older revision that's been replaced — OK as history
   - `uninstalling`: uninstall in progress
   - `uninstalled`: removed (may still exist if `--keep-history`)
2. **For `pending-*` states**, identify the cause:
   - Helm hung waiting on resources (`--wait` with a slow rollout that didn't finish)
   - Helm was interrupted (Ctrl-C, CI timeout, network blip)
   - A webhook timed out
   - User has been running `helm upgrade` while a previous one was still pending → "another operation in progress" lock
3. **For `failed` state**:
   - What did Helm install/modify before failing? `kubectl get all -l app.kubernetes.io/instance=<release>`
   - Is the cluster in a partial state (some new things created, some old things deleted)?
   - Can the chart be re-run safely (idempotent)?
4. **Recommend the recovery path in safest-first order**:
   - **`helm rollback <release> <revision>`** → safest if a known-good revision exists
   - **`helm upgrade --force <release>`** → re-applies; useful for failed upgrade
   - **`helm upgrade --reset-values --reuse-values`** → trick to reset state with same values
   - **`helm uninstall <release> --no-hooks`** → removes everything; useful when chart is unrecoverable
   - **DELETE the release secret directly** (`kubectl delete secret -n <ns> sh.helm.release.v1.<release>.<revision>`) → DESTRUCTIVE: Helm "forgets" the release; resources remain orphaned in the cluster
   - **EDIT the release secret status field** (advanced) → change `pending-upgrade` to `deployed` to unblock
5. **For "another operation in progress" lock**: identify whether a real operation is actually in flight (rare) vs. a stale lock (common). Then unstick.
6. **For Helm-secret bloat** (`max history of 10 secrets in a critical chart`):
   - Helm stores each revision as a separate Secret in the namespace
   - Secrets > ~1MB cause issues with `kubectl apply` (annotations) and etcd
   - Mitigation: `helm history --max=10` + occasional cleanup of pre-`max` revisions
   - For very large charts: use `--history-max` flag at install time
7. **For `--atomic` operations that failed mid-flight**:
   - Helm tries to rollback automatically
   - If that rollback fails (e.g., the chart in the previous revision is now incompatible), you can end up in `pending-rollback`
   - Recovery typically: identify whether the partial state is closer to the "old" or "new" version, then converge with `helm upgrade --force`
8. Mark every DESTRUCTIVE action explicitly.

---

Release name + namespace: [DESCRIBE]
Chart + version: [e.g., bitnami/postgresql 14.3.0]
`helm status <release>`:
```
[PASTE]
```
`helm history <release>`:
```
[PASTE]
```
What the user was doing when it got stuck: [DESCRIBE]
Error from helm install/upgrade:
```
[PASTE]
```
Live resources matching the release:
```
[PASTE kubectl get all -l app.kubernetes.io/instance=<release> -n <ns>]
```

Why this prompt works

Helm errors are confusing because Helm stores state in Kubernetes (release secrets) AND modifies cluster resources, and those two views can diverge. “Stuck pending-upgrade” doesn’t mean Helm is busy — it usually means Helm thinks it’s busy because nobody told it the previous operation gave up. This prompt forces an inventory: what does Helm think, what’s actually in the cluster, and what’s the safest reconciliation?

How to use it

  1. Always run helm get values <release> > backup.yaml before any destructive recovery. Values are often un-version-controlled state.
  2. Always run helm history <release> to see if there’s a known-good revision to roll back to.
  3. Check live cluster state separately: kubectl get all,pvc,configmap,secret -l app.kubernetes.io/instance=<release>. Confirm what’s actually there before letting Helm “reconcile.”
  4. Distinguish “stuck” from “slow”: a release pending-upgrade 10 seconds after an upgrade is normal; 10 minutes is stuck.

Useful commands

# Inventory
helm list -A
helm list -A --pending           # only stuck releases
helm status <release> -n <ns>
helm history <release> -n <ns>

# Get state
helm get values <release> -n <ns> > current-values.yaml
helm get values <release> -n <ns> --all > all-values.yaml   # including defaults
helm get manifest <release> -n <ns> > current-manifest.yaml
helm get notes <release> -n <ns>
helm get hooks <release> -n <ns>

# Helm release secrets (state storage)
kubectl get secrets -n <ns> -l owner=helm
# A release with 5 revisions has 5 secrets named sh.helm.release.v1.<release>.v1..v5
kubectl describe secret -n <ns> sh.helm.release.v1.<release>.v3 | head

# Live resources tied to release
kubectl get all,pvc,configmap,secret -n <ns> -l app.kubernetes.io/instance=<release>

# Recovery options (safe → less safe)
helm rollback <release> <revision> -n <ns>
helm rollback <release> -n <ns>                           # to previous

helm upgrade <release> <chart> -n <ns> --reuse-values --force
helm upgrade <release> <chart> -n <ns> --atomic --timeout 5m

# Last resort
helm uninstall <release> -n <ns>                          # destroys release
helm uninstall <release> -n <ns> --no-hooks               # skip pre/post hooks
kubectl delete secret -n <ns> sh.helm.release.v1.<release>.v<N>   # surgical "Helm forget"

# "Another operation in progress" recovery (stale lock)
# 1. Confirm no real operation is in flight (no Helm CLI running)
# 2. Find the pending release secret:
kubectl get secrets -n <ns> -l owner=helm,status=pending-upgrade -o json | \
  jq -r '.items[] | .metadata.name'
# 3. Either edit status field (advanced) or delete that one secret to release the lock
# WARNING: deleting loses the upgrade attempt's recorded values

# Cleanup old revisions (after stability)
helm history <release> -n <ns> --max 5
# Helm by default keeps 10; you can force cleanup with:
kubectl delete secret -n <ns> sh.helm.release.v1.<release>.v<old-revision>

Recovery decision tree

helm status <release> shows:

├── deployed      → no recovery needed; you have a working release

├── failed        → recoverable
│   ├── Cluster state matches old version: helm rollback <release>
│   ├── Cluster state mostly matches new version: helm upgrade --force
│   └── Both directions look ugly: helm uninstall + reinstall (data loss risk)

├── pending-install      → install never finished
│   ├── Real operation in flight: WAIT
│   ├── Stale lock: delete the pending release secret OR retry install
│   └── If install partially created resources: clean them OR include in next install

├── pending-upgrade     → upgrade never finished
│   ├── Real operation in flight: WAIT
│   ├── Stale lock: identify last good revision; rollback OR delete pending secret + retry

├── pending-rollback   → previous rollback didn't finish
│   ├── Identify intended target revision; resume manually if possible
│   └── Worst case: delete pending-rollback secret + helm rollback to a known good

└── uninstalling       → uninstall in progress; usually just slow
    └── If stuck > 30 min: check for hooks blocking; --no-hooks retry

Common findings this catches

  • “Another operation in progress” but nothing is actually running → stale pending-upgrade lock from a CI job that was killed. Delete that revision’s secret (and possibly that revision’s resources if partially applied).
  • helm rollback succeeds but resources don’t change → the chart’s helm.sh/resource-policy: keep annotation kept old resources around; kubectl delete them manually before rollback.
  • Release failed because a CRD wasn’t installed before its CR → install CRDs separately (or use --skip-crds=false + crds/ dir in chart) before the main chart.
  • PVC deleted on helm uninstall because chart didn’t have helm.sh/resource-policy: keep on the PVC template. Data loss. Restore from backup.
  • Helm release secret > 1MB causing kubectl apply annotation issues → chart has too many large resources; consider splitting into sub-charts.
  • helm upgrade --atomic failed and auto-rollback also failed → ended in pending-rollback. Manual rebuild required.

Helm release secret status values

pending-install, pending-upgrade, pending-rollback, deployed, failed, superseded, uninstalling, uninstalled.

To edit (advanced):

SECRET=sh.helm.release.v1.<release>.v<N>
kubectl get secret $SECRET -n <ns> -o json | \
  jq '.metadata.labels.status="deployed"' | \
  kubectl apply -f -
# Helm 3 also encodes status inside the binary release data; CLI tools like
# `helm-mapkubeapis` and the secret-edit approach require care

Preventive practices

  • Pin chart versions in CI (--version); never helm upgrade to “latest” implicitly.
  • Use --atomic --timeout 10m in CI to get clean failure modes.
  • Set --history-max 20 at install; clean up older revisions periodically.
  • Use helm.sh/resource-policy: keep annotation on PVCs and irreplaceable resources.
  • Commit values.yaml for every environment to git; never rely solely on --set flags in operator memory.

When to escalate

  • Production release stuck and rollback target’s resources are gone (deleted out-of-band) — engage chart owner; manual reconstruction.
  • A Helm chart whose --cleanup-on-fail left orphan PVCs with production data — escalate; data recovery is the priority over cleanup.
  • Multi-chart releases where one chart’s resources depend on another’s — coordinate the recovery; piecemeal fixes can compound.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.