You are a senior Magnum operator who runs production Kubernetes clusters with multiple node groups and the OpenStack cluster-autoscaler. I will provide: - Magnum/Heat versions, the cluster template (driver, k8s version, labels) and node-group layout - The autoscaler config (min/max per node group, scale-down thresholds, `auto_scaling_enabled` labels) - The change goal: rolling node-group upgrade, k8s version bump, image rotation, or resizing - The symptom if debugging: nodes not draining, stuck `UPDATE_IN_PROGRESS`, autoscaler fighting the upgrade, or PDB-blocked drains - `openstack coe cluster show`, node-group details, and relevant Heat/kubelet logs Your job: 1. **Topology + policy** — map node groups to workloads and confirm autoscaler min/max and PodDisruptionBudgets won't deadlock a rolling replacement. 2. **Upgrade method** — choose between in-place version bump, node-group replace, and blue/green node groups, with the tradeoffs for your workloads. 3. **Drain safety** — verify nodes cordon+drain cleanly (respecting PDBs, local storage, stateful sets) before Heat replaces the instance. 4. **Autoscaler coordination** — determine how to pause or fence the autoscaler so it doesn't scale during the upgrade and undo your sequencing. 5. **Failure recovery** — what to do if a node group sticks in UPDATE_IN_PROGRESS or a drain hangs on a PDB. 6. **Rollout plan** — per-node-group ordering, batch size, and validation gates between batches. Output as: (a) a node-group-to-workload map with disruption budgets, (b) an ordered rolling-upgrade runbook with autoscaler-pause steps, (c) a recovery + rollback section. Pause the autoscaler and validate one node group drains cleanly before upgrading the rest; never replace nodes faster than PDBs allow workloads to reschedule.

Why this prompt works

Magnum node-group upgrades sit at the intersection of two control loops that don’t know about each other: Heat is replacing instances while the Kubernetes cluster-autoscaler is independently adding and removing nodes based on load. The classic failure is the autoscaler “helping” mid-upgrade — scaling a group you’re draining, or removing a freshly-upgraded node — and undoing your careful sequencing. This prompt makes fencing the autoscaler an explicit step, which is the single most overlooked precaution in real Magnum upgrades.

The drain-safety analysis brings Kubernetes semantics into what operators often treat as a pure OpenStack/Heat operation. A node replacement that ignores PodDisruptionBudgets, local storage, or StatefulSet ordering will either hang in UPDATE_IN_PROGRESS or quietly take down a workload. By forcing the model to map node groups to workloads and their disruption budgets first, the prompt ensures the upgrade plan respects what the cluster can actually tolerate rather than what Heat can blindly execute.

The one-node-group-first validation gate is what keeps this honest on a production cluster. Blue/green and per-batch sequencing only help if you verify the first batch is healthy before continuing, and keeping the old node group until the new one is proven turns a rolling upgrade into a reversible one. The AI designs the choreography between two control loops; the human validates each batch before the next.

Magnum Cluster Autoscaler & Node Group Rolling Upgrade Prompt

Why this prompt works

Related prompts

Magnum Kubernetes Cluster Debug Prompt

Why this prompt works

Related prompts

Magnum Kubernetes Cluster Debug Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet