Running Kubernetes on OpenStack with Magnum

Magnum is OpenStack’s container-orchestration service — openstack coe cluster create and a few minutes later you have a working Kubernetes cluster on your private cloud. The pitch is great. The reality is that Magnum is a thin, opinionated layer that orchestrates Heat, which orchestrates Nova, Neutron, Cinder, Octavia, and Barbican, and then bootstraps Kubernetes on top — so a Magnum failure can live in any of six services. After years of running Magnum-managed clusters, I’ve learned to debug it as what it is: a Heat stack with a Kubernetes bootstrap stapled on. Here’s the approach.

What actually happens on `cluster create`

When you create a cluster, Magnum renders a cluster template into a Heat stack. Heat builds the master and worker nodes (Nova), the network and load balancer for the API (Neutron + Octavia), and the volumes (Cinder), then runs cloud-init scripts that install and configure Kubernetes — kubelet, the API server, etcd, and the Cloud Provider OpenStack that lets Kubernetes create Cinder volumes and Octavia load balancers on demand. The cluster goes CREATE_COMPLETE only when the nodes report back to Magnum that bootstrap finished.

That chain is why “my cluster is stuck in CREATE_IN_PROGRESS” is usually not a Magnum bug — it’s a Heat resource that failed or a node that booted but couldn’t finish its cloud-init.

Step 1: Read the cluster and its Heat stack

Start with Magnum, then immediately drop to Heat:

openstack coe cluster list
openstack coe cluster show <cluster-name>
# Magnum stores the stack id — go straight to it:
openstack stack resource list <stack-id> --nested-depth 3 | grep -i fail
openstack stack event list <stack-id> --nested-depth 3 | grep -i fail

If a Heat resource failed, you’ve localized the problem to Nova capacity, a Neutron quota, an Octavia amphora that wouldn’t boot, etc. — and you debug that service. If Heat shows CREATE_COMPLETE but Magnum still says IN_PROGRESS, the infrastructure built fine and the Kubernetes bootstrap is what’s hanging.

Step 2: When the nodes boot but Kubernetes never comes up

This is the classic “stuck for 20 minutes then CREATE_FAILED” case. The VMs exist; cloud-init didn’t finish. SSH into a master (Magnum injects your keypair) and read the bootstrap logs:

ssh -i mykey.pem core@<master-ip>   # or 'fedora'/'ubuntu' per image
sudo journalctl -u cloud-final -f
sudo journalctl -u kubelet -f

The usual culprits: the nodes can’t reach the container registry to pull the Kubernetes images (egress/proxy problem), the master can’t reach the OpenStack APIs to run Cloud Provider OpenStack (wrong cloud_provider_tag or auth), or the cluster template references an image/Kubernetes version combination that doesn’t exist. A node that can’t pull images will hang in cloud-init forever — check egress first.

Step 3: Get the cluster template right

Most Magnum pain is a wrong template, and templates are unforgiving. The labels carry critical glue:

openstack coe cluster template show <template>
openstack coe cluster template create k8s-prod \
  --image fedora-coreos-XX \
  --keypair mykey \
  --external-network public \
  --master-flavor m1.large \
  --flavor m1.large \
  --coe kubernetes \
  --network-driver calico \
  --volume-driver cinder \
  --docker-storage-driver overlay2 \
  --labels kube_tag=v1.28.4,cloud_provider_tag=v1.28.0,container_runtime=containerd

The kube_tag and cloud_provider_tag must be compatible, the image must be a supported Fedora CoreOS / image that Magnum knows how to bootstrap, and the external-network must actually be your floating-IP network. A version skew between kube_tag and cloud_provider_tag is the most common silent failure — the cluster builds but Cloud Provider OpenStack crash-loops and PersistentVolumeClaims never bind.

Step 4: Cloud Provider OpenStack — the integration that earns its keep

The whole reason to run Magnum instead of plain VMs is the integration: a LoadBalancer Service in Kubernetes provisions a real Octavia LB, a PVC provisions a real Cinder volume. When that integration is broken, the cluster looks healthy but apps can’t get storage or ingress. Verify from inside the cluster:

kubectl -n kube-system get pods | grep -E 'openstack|cinder'
kubectl -n kube-system logs <openstack-cloud-controller-pod>

A crash-looping cloud controller almost always means bad credentials in its secret or an unreachable Keystone endpoint from the cluster network. Fix the secret/endpoint and PVCs and LoadBalancers start working.

Step 5: Scaling and upgrades

Magnum scales by updating the Heat stack:

openstack coe cluster resize <cluster-name> 5
openstack coe cluster upgrade <cluster-name> <new-template>

resize adjusts the worker ResourceGroup — watch the Heat stack, because a resize that fails on capacity leaves the stack in UPDATE_FAILED and the cluster in a degraded state until you resolve the Nova/quota issue. Rolling upgrades replace nodes one at a time using the new template’s kube_tag; always upgrade a non-prod cluster first because a bad template can take out the whole fleet.

Where AI helps

Magnum failures are the worst kind of cross-service puzzle — the symptom is “cluster won’t come up” and the cause could be Heat, Nova capacity, an Octavia amphora, a registry egress rule, or a version-skewed label. I’ll paste the cluster show, the failed Heat events, and the cloud-init/kubelet log tail into a model and ask it to walk the chain and tell me the first thing that broke — infra vs. bootstrap vs. integration. It’s good at catching a kube_tag/cloud_provider_tag skew or an image-pull hang that I’d otherwise dig for.

Keep a saved Magnum triage prompt and lean on our other OpenStack guides — Heat, Octavia, and the networking guide especially, because a Magnum cluster is only as healthy as the services underneath it. The model reads the chain; you run every command yourself after you’ve understood what it does.

Generated commands and templates are assistive, not authoritative. Always verify against your own deployment before running anything in production.