Helm Release Rollback & Stuck Release Debug Prompt
Recover from a Helm release stuck in `pending-install` / `pending-upgrade` / `failed`, roll back safely, and avoid Helm-secret bloat that breaks future operations.
- Target user
- Kubernetes platform engineers using Helm in production
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes platform engineer with deep experience operating Helm releases in production. You know that "just delete the secret" is sometimes the right answer and sometimes catastrophic — and you can tell which is which. I will provide: - The release name and namespace, and what `helm status <release> -n <ns>` shows - `helm history <release> -n <ns>` - `helm get values <release> -n <ns>` - The chart version + repo + chart name - What the user was trying to do when the release got stuck (install, upgrade, rollback, uninstall) - The error from `helm install/upgrade` if any - Whether `--atomic`, `--wait`, `--timeout`, or `--cleanup-on-fail` was used - `kubectl get secrets -n <ns> -l owner=helm` (Helm stores release state as secrets here) Your job: 1. **Decode the release state** from `helm status`: - `deployed`: successful current revision - `failed`: last operation failed; chart is partially applied - `pending-install`: install started but didn't finish (e.g., timed out, user Ctrl-C'd) - `pending-upgrade`: same for upgrade - `pending-rollback`: rollback started, didn't finish - `superseded`: an older revision that's been replaced — OK as history - `uninstalling`: uninstall in progress - `uninstalled`: removed (may still exist if `--keep-history`) 2. **For `pending-*` states**, identify the cause: - Helm hung waiting on resources (`--wait` with a slow rollout that didn't finish) - Helm was interrupted (Ctrl-C, CI timeout, network blip) - A webhook timed out - User has been running `helm upgrade` while a previous one was still pending → "another operation in progress" lock 3. **For `failed` state**: - What did Helm install/modify before failing? `kubectl get all -l app.kubernetes.io/instance=<release>` - Is the cluster in a partial state (some new things created, some old things deleted)? - Can the chart be re-run safely (idempotent)? 4. **Recommend the recovery path in safest-first order**: - **`helm rollback <release> <revision>`** → safest if a known-good revision exists - **`helm upgrade --force <release>`** → re-applies; useful for failed upgrade - **`helm upgrade --reset-values --reuse-values`** → trick to reset state with same values - **`helm uninstall <release> --no-hooks`** → removes everything; useful when chart is unrecoverable - **DELETE the release secret directly** (`kubectl delete secret -n <ns> sh.helm.release.v1.<release>.<revision>`) → DESTRUCTIVE: Helm "forgets" the release; resources remain orphaned in the cluster - **EDIT the release secret status field** (advanced) → change `pending-upgrade` to `deployed` to unblock 5. **For "another operation in progress" lock**: identify whether a real operation is actually in flight (rare) vs. a stale lock (common). Then unstick. 6. **For Helm-secret bloat** (`max history of 10 secrets in a critical chart`): - Helm stores each revision as a separate Secret in the namespace - Secrets > ~1MB cause issues with `kubectl apply` (annotations) and etcd - Mitigation: `helm history --max=10` + occasional cleanup of pre-`max` revisions - For very large charts: use `--history-max` flag at install time 7. **For `--atomic` operations that failed mid-flight**: - Helm tries to rollback automatically - If that rollback fails (e.g., the chart in the previous revision is now incompatible), you can end up in `pending-rollback` - Recovery typically: identify whether the partial state is closer to the "old" or "new" version, then converge with `helm upgrade --force` 8. Mark every DESTRUCTIVE action explicitly. --- Release name + namespace: [DESCRIBE] Chart + version: [e.g., bitnami/postgresql 14.3.0] `helm status <release>`: ``` [PASTE] ``` `helm history <release>`: ``` [PASTE] ``` What the user was doing when it got stuck: [DESCRIBE] Error from helm install/upgrade: ``` [PASTE] ``` Live resources matching the release: ``` [PASTE kubectl get all -l app.kubernetes.io/instance=<release> -n <ns>] ```
Why this prompt works
Helm errors are confusing because Helm stores state in Kubernetes (release secrets) AND modifies cluster resources, and those two views can diverge. “Stuck pending-upgrade” doesn’t mean Helm is busy — it usually means Helm thinks it’s busy because nobody told it the previous operation gave up. This prompt forces an inventory: what does Helm think, what’s actually in the cluster, and what’s the safest reconciliation?
How to use it
- Always run
helm get values <release> > backup.yamlbefore any destructive recovery. Values are often un-version-controlled state. - Always run
helm history <release>to see if there’s a known-good revision to roll back to. - Check live cluster state separately:
kubectl get all,pvc,configmap,secret -l app.kubernetes.io/instance=<release>. Confirm what’s actually there before letting Helm “reconcile.” - Distinguish “stuck” from “slow”: a release
pending-upgrade10 seconds after an upgrade is normal; 10 minutes is stuck.
Useful commands
# Inventory
helm list -A
helm list -A --pending # only stuck releases
helm status <release> -n <ns>
helm history <release> -n <ns>
# Get state
helm get values <release> -n <ns> > current-values.yaml
helm get values <release> -n <ns> --all > all-values.yaml # including defaults
helm get manifest <release> -n <ns> > current-manifest.yaml
helm get notes <release> -n <ns>
helm get hooks <release> -n <ns>
# Helm release secrets (state storage)
kubectl get secrets -n <ns> -l owner=helm
# A release with 5 revisions has 5 secrets named sh.helm.release.v1.<release>.v1..v5
kubectl describe secret -n <ns> sh.helm.release.v1.<release>.v3 | head
# Live resources tied to release
kubectl get all,pvc,configmap,secret -n <ns> -l app.kubernetes.io/instance=<release>
# Recovery options (safe → less safe)
helm rollback <release> <revision> -n <ns>
helm rollback <release> -n <ns> # to previous
helm upgrade <release> <chart> -n <ns> --reuse-values --force
helm upgrade <release> <chart> -n <ns> --atomic --timeout 5m
# Last resort
helm uninstall <release> -n <ns> # destroys release
helm uninstall <release> -n <ns> --no-hooks # skip pre/post hooks
kubectl delete secret -n <ns> sh.helm.release.v1.<release>.v<N> # surgical "Helm forget"
# "Another operation in progress" recovery (stale lock)
# 1. Confirm no real operation is in flight (no Helm CLI running)
# 2. Find the pending release secret:
kubectl get secrets -n <ns> -l owner=helm,status=pending-upgrade -o json | \
jq -r '.items[] | .metadata.name'
# 3. Either edit status field (advanced) or delete that one secret to release the lock
# WARNING: deleting loses the upgrade attempt's recorded values
# Cleanup old revisions (after stability)
helm history <release> -n <ns> --max 5
# Helm by default keeps 10; you can force cleanup with:
kubectl delete secret -n <ns> sh.helm.release.v1.<release>.v<old-revision>
Recovery decision tree
helm status <release> shows:
│
├── deployed → no recovery needed; you have a working release
│
├── failed → recoverable
│ ├── Cluster state matches old version: helm rollback <release>
│ ├── Cluster state mostly matches new version: helm upgrade --force
│ └── Both directions look ugly: helm uninstall + reinstall (data loss risk)
│
├── pending-install → install never finished
│ ├── Real operation in flight: WAIT
│ ├── Stale lock: delete the pending release secret OR retry install
│ └── If install partially created resources: clean them OR include in next install
│
├── pending-upgrade → upgrade never finished
│ ├── Real operation in flight: WAIT
│ ├── Stale lock: identify last good revision; rollback OR delete pending secret + retry
│
├── pending-rollback → previous rollback didn't finish
│ ├── Identify intended target revision; resume manually if possible
│ └── Worst case: delete pending-rollback secret + helm rollback to a known good
│
└── uninstalling → uninstall in progress; usually just slow
└── If stuck > 30 min: check for hooks blocking; --no-hooks retry
Common findings this catches
- “Another operation in progress” but nothing is actually running → stale
pending-upgradelock from a CI job that was killed. Delete that revision’s secret (and possibly that revision’s resources if partially applied). helm rollbacksucceeds but resources don’t change → the chart’shelm.sh/resource-policy: keepannotation kept old resources around;kubectl deletethem manually before rollback.- Release
failedbecause a CRD wasn’t installed before its CR → install CRDs separately (or use--skip-crds=false+crds/dir in chart) before the main chart. - PVC deleted on
helm uninstallbecause chart didn’t havehelm.sh/resource-policy: keepon the PVC template. Data loss. Restore from backup. - Helm release secret > 1MB causing
kubectl applyannotation issues → chart has too many large resources; consider splitting into sub-charts. helm upgrade --atomicfailed and auto-rollback also failed → ended inpending-rollback. Manual rebuild required.
Helm release secret status values
pending-install, pending-upgrade, pending-rollback, deployed, failed, superseded, uninstalling, uninstalled.
To edit (advanced):
SECRET=sh.helm.release.v1.<release>.v<N>
kubectl get secret $SECRET -n <ns> -o json | \
jq '.metadata.labels.status="deployed"' | \
kubectl apply -f -
# Helm 3 also encodes status inside the binary release data; CLI tools like
# `helm-mapkubeapis` and the secret-edit approach require care
Preventive practices
- Pin chart versions in CI (
--version); neverhelm upgradeto “latest” implicitly. - Use
--atomic --timeout 10min CI to get clean failure modes. - Set
--history-max 20at install; clean up older revisions periodically. - Use
helm.sh/resource-policy: keepannotation on PVCs and irreplaceable resources. - Commit
values.yamlfor every environment to git; never rely solely on--setflags in operator memory.
When to escalate
- Production release stuck and rollback target’s resources are gone (deleted out-of-band) — engage chart owner; manual reconstruction.
- A Helm chart whose
--cleanup-on-failleft orphan PVCs with production data — escalate; data recovery is the priority over cleanup. - Multi-chart releases where one chart’s resources depend on another’s — coordinate the recovery; piecemeal fixes can compound.
Related prompts
-
Helm Chart Review Prompt
Get a senior-engineer review of a Helm chart — values hygiene, template correctness, security defaults, upgrade safety.
-
Kubernetes Pod Troubleshooting Prompt
Diagnose any misbehaving pod — pending, evicted, networking-broken, storage-stuck, or just plain slow — with a structured AI walkthrough.
-
Kubernetes RBAC Audit Prompt
Audit Kubernetes Role, ClusterRole, RoleBinding, and ClusterRoleBinding for excessive permissions, stale bindings, and dangerous wildcards.