Building Production-Ready Magnum Cluster Templates in

Magnum makes spinning up a Kubernetes cluster on OpenStack a single command, which is exactly why so many teams end up with clusters they can’t upgrade. The default cluster template gets you a demo; a production template needs deliberate choices about node groups, the autoscaler, networking, and — most importantly — how you’ll roll the cluster forward without taking down workloads. I’ve inherited “temporary” Magnum clusters that became load-bearing and then couldn’t be patched because nobody designed for the upgrade. Here’s how I build cluster templates that survive contact with production, and where AI helps plan the riskiest operation: the rolling node-group upgrade.

Cluster Templates Are Immutable, So Get Them Right

A Magnum cluster template captures the cluster’s DNA: the driver, Kubernetes version, network driver, volume driver, and a pile of labels. Crucially, templates are effectively immutable once clusters use them — you create new templates for new versions rather than mutating live ones. List and inspect:

openstack coe cluster template list
openstack coe cluster template show <template>

Because the template is your versioning unit, name it for the k8s version it deploys (k8s-v1.29-prod) so an upgrade is a new template, not an edit. The openstack category has the broader container-on-OpenStack playbooks.

The Labels That Actually Matter for Production

Magnum’s behavior is largely driven by labels. The ones I always set deliberately for production:

openstack coe cluster template create k8s-v1.29-prod \
  --image fedora-coreos-prod --keypair mykey \
  --external-network public --network-driver calico \
  --master-flavor m1.large --flavor m1.xlarge \
  --master-count 3 \
  --labels auto_scaling_enabled=true,min_node_count=3,max_node_count=10,\
auto_healing_enabled=true,kube_tag=v1.29.4,availability_zone=az1

Three masters for control-plane HA, auto_healing_enabled so a sick node gets replaced, and explicit kube_tag so you know exactly what version you’re running. The autoscaler labels (min/max_node_count) bound the cluster’s growth. Each of these is a decision you want made on purpose, not inherited from an example.

Node Groups: Don’t Run One Pool for Everything

A single node group forces every workload onto identical hardware and makes upgrades all-or-nothing. Production clusters use multiple node groups — say, a general pool and a memory-heavy pool — so you can size and upgrade them independently:

openstack coe nodegroup create <cluster> mem-pool \
  --flavor m1.2xlarge --min-nodes 2 --max-nodes 6 \
  --role worker

Separate node groups also let you upgrade one pool at a time, which is the whole game for safe rolling upgrades.

The Hard Part: Rolling Upgrades Without Dropping Workloads

Here’s where most Magnum clusters get stuck. Upgrading means replacing nodes, and replacing a node means draining its pods — but two control loops are running at once: Heat is replacing instances while the cluster-autoscaler is independently adding and removing nodes based on load. The classic failure is the autoscaler “helping” mid-upgrade and undoing your sequencing.

Prompt: “Here’s my Magnum cluster: 2 node groups (general min3/max10, mem-pool min2/max6), autoscaler enabled, and these PodDisruptionBudgets. I want to upgrade from kube_tag v1.29.4 to v1.30.x. Produce a node-group-by-node-group rolling upgrade runbook that (a) fences the autoscaler before starting, (b) respects the PDBs during drains, and (c) validates one node group is healthy before touching the next. Flag any PDB that could deadlock a drain. Don’t tell me to run the upgrade command yet — I want to review the sequence first.”

Output: A runbook that first scaled the autoscaler’s min=max to pin node counts, upgraded the mem-pool first (smaller blast radius), drained nodes one at a time respecting a minAvailable: 2 PDB on a StatefulSet, validated pod rescheduling, then proceeded to the general pool — and it flagged a maxUnavailable: 0 PDB that would have hung the drain indefinitely.

That maxUnavailable: 0 catch is exactly the kind of deadlock that turns a routine upgrade into a 2 a.m. incident. The AI is excellent at sequencing against stated constraints and spotting the PDB that would wedge a drain. But I verify the autoscaler-fencing step works on my actual deployment before trusting it, because “pin min=max” behaves differently across autoscaler versions.

Fencing the Autoscaler

The concrete fencing move is to remove the autoscaler’s freedom during the upgrade by setting min and max node counts equal, then restore them after:

openstack coe nodegroup update <cluster> general \
  replace /min_node_count=5 /max_node_count=5
# ... do the upgrade ...
openstack coe nodegroup update <cluster> general \
  replace /min_node_count=3 /max_node_count=10

Pro Tip: Always confirm the autoscaler is actually fenced before you replace a single node. An autoscaler that scales down a node you just upgraded, or scales up an old-version node mid-roll, will quietly undo your work and leave you with a mixed-version pool you didn’t plan for.

Recovering From a Stuck Upgrade

When an upgrade sticks in UPDATE_IN_PROGRESS, it’s usually a drain hung on a PDB or a node that won’t come up healthy. Don’t force-delete pods to “unstick” it — that defeats the PDB’s purpose and can take down a stateful workload. Investigate the hung drain:

kubectl get pods -A -o wide | grep -v Running
kubectl describe node <draining-node> | grep -A5 Taints
openstack coe cluster show <cluster> | grep status

When I’m untangling a stuck rolling upgrade, I’ll hand the cluster status, the pending pods, and the PDBs to Claude and ask it to identify the specific PDB or unschedulable pod blocking the drain. That diagnosis is fast and usually right; I confirm by reading the PDB and the pod events myself before acting. Reusable Magnum prompts live in the prompt workspace.

Validate One Node Group, Keep the Old One

The discipline that makes upgrades reversible: upgrade the smallest node group first, validate workloads are healthy on the new nodes, and keep the old node group until the new one is proven. Blue/green at the node-group level turns a scary in-place roll into something you can back out of.

Conclusion

Production-ready Magnum is about designing for the upgrade you’ll have to do six months from now: immutable version-named templates, multiple node groups, deliberate labels, and a rolling-upgrade plan that fences the autoscaler and respects PodDisruptionBudgets. AI is genuinely strong at the sequencing and the deadlock-spotting — fencing steps, PDB conflicts, per-group ordering — and that’s where I lean on it. But every plan it drafts gets verified against my actual autoscaler version and PDBs, and I keep the old node group until the new one is proven. The model choreographs the upgrade; you validate each batch. More Magnum prompts are in the prompts library.

Building Production-Ready Magnum Cluster Templates in OpenStack