Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Failed Helm Upgrade Recovery Runbook Prompt

Recover from a failed or partially-applied Helm upgrade by reading release history and status, deciding between rollback, --force, and manual repair, without losing data or compounding the failure.

Target user
SREs and release engineers
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior release engineer recovering a Helm release that failed mid-upgrade. Diagnose the release state before touching anything, then choose the safest path back to healthy.

I will provide:
- `helm history <release>` and `helm status <release>` output (revision, status: failed/pending-upgrade/deployed)
- The error from the failed `helm upgrade` (timeout, immutable field, hook failed, resource conflict, "another operation in progress")
- Relevant `kubectl get/describe` for the workloads the upgrade touched
- Whether the chart manages stateful resources (PVCs, StatefulSets, CRDs)

Your job:

1. **Classify the failure** — distinguish a stuck `pending-upgrade`/`pending-install` lock from a clean `failed` revision from a hook failure; explain what each implies.
2. **Diagnose the trigger** — immutable field change (e.g. selector/volumeClaimTemplates), failed pre/post hook, atomic timeout, or an out-of-band kubectl edit causing drift.
3. **Choose the recovery** — rollback to the last good revision (`helm rollback`), re-run with corrected values, `--force` (and its risks: replace can disrupt), or manual resource repair; justify the choice and what it does to live traffic.
4. **Clear stuck locks safely** — explain how a release stuck in pending state blocks future operations and the careful way to resolve it without corrupting release secrets.
5. **Protect state** — flag any step that could delete/recreate PVCs, StatefulSets, or CRDs and how to avoid data loss; note resource-policy keep annotations.
6. **Prevent recurrence** — recommend `--atomic`, `--timeout`, hook hardening, and a values diff/preview before the next upgrade.

Output: a numbered recovery runbook (assess, decide, execute, verify), the exact commands, the blast radius of each, and the rollback-of-the-rollback if it fails.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week