Running Kubernetes on OpenStack with Magnum
Cluster templates, stuck CREATE_IN_PROGRESS, and the Cloud Provider OpenStack glue. Here's how to run Magnum-managed Kubernetes in production.
- #openstack
- #magnum
- #kubernetes
- #containers
- #heat
- #cloud-provider
Magnum is OpenStack’s container-orchestration service — openstack coe cluster create and a few minutes later you have a working Kubernetes cluster on your private cloud. The pitch is great. The reality is that Magnum is a thin, opinionated layer that orchestrates Heat, which orchestrates Nova, Neutron, Cinder, Octavia, and Barbican, and then bootstraps Kubernetes on top — so a Magnum failure can live in any of six services. After years of running Magnum-managed clusters, I’ve learned to debug it as what it is: a Heat stack with a Kubernetes bootstrap stapled on. Here’s the approach.
What actually happens on cluster create
When you create a cluster, Magnum renders a cluster template into a Heat stack. Heat builds the master and worker nodes (Nova), the network and load balancer for the API (Neutron + Octavia), and the volumes (Cinder), then runs cloud-init scripts that install and configure Kubernetes — kubelet, the API server, etcd, and the Cloud Provider OpenStack that lets Kubernetes create Cinder volumes and Octavia load balancers on demand. The cluster goes CREATE_COMPLETE only when the nodes report back to Magnum that bootstrap finished.
That chain is why “my cluster is stuck in CREATE_IN_PROGRESS” is usually not a Magnum bug — it’s a Heat resource that failed or a node that booted but couldn’t finish its cloud-init.
Step 1: Read the cluster and its Heat stack
Start with Magnum, then immediately drop to Heat:
openstack coe cluster list
openstack coe cluster show <cluster-name>
# Magnum stores the stack id — go straight to it:
openstack stack resource list <stack-id> --nested-depth 3 | grep -i fail
openstack stack event list <stack-id> --nested-depth 3 | grep -i fail
If a Heat resource failed, you’ve localized the problem to Nova capacity, a Neutron quota, an Octavia amphora that wouldn’t boot, etc. — and you debug that service. If Heat shows CREATE_COMPLETE but Magnum still says IN_PROGRESS, the infrastructure built fine and the Kubernetes bootstrap is what’s hanging.
Step 2: When the nodes boot but Kubernetes never comes up
This is the classic “stuck for 20 minutes then CREATE_FAILED” case. The VMs exist; cloud-init didn’t finish. SSH into a master (Magnum injects your keypair) and read the bootstrap logs:
ssh -i mykey.pem core@<master-ip> # or 'fedora'/'ubuntu' per image
sudo journalctl -u cloud-final -f
sudo journalctl -u kubelet -f
The usual culprits: the nodes can’t reach the container registry to pull the Kubernetes images (egress/proxy problem), the master can’t reach the OpenStack APIs to run Cloud Provider OpenStack (wrong cloud_provider_tag or auth), or the cluster template references an image/Kubernetes version combination that doesn’t exist. A node that can’t pull images will hang in cloud-init forever — check egress first.
Step 3: Get the cluster template right
Most Magnum pain is a wrong template, and templates are unforgiving. The labels carry critical glue:
openstack coe cluster template show <template>
openstack coe cluster template create k8s-prod \
--image fedora-coreos-XX \
--keypair mykey \
--external-network public \
--master-flavor m1.large \
--flavor m1.large \
--coe kubernetes \
--network-driver calico \
--volume-driver cinder \
--docker-storage-driver overlay2 \
--labels kube_tag=v1.28.4,cloud_provider_tag=v1.28.0,container_runtime=containerd
The kube_tag and cloud_provider_tag must be compatible, the image must be a supported Fedora CoreOS / image that Magnum knows how to bootstrap, and the external-network must actually be your floating-IP network. A version skew between kube_tag and cloud_provider_tag is the most common silent failure — the cluster builds but Cloud Provider OpenStack crash-loops and PersistentVolumeClaims never bind.
Step 4: Cloud Provider OpenStack — the integration that earns its keep
The whole reason to run Magnum instead of plain VMs is the integration: a LoadBalancer Service in Kubernetes provisions a real Octavia LB, a PVC provisions a real Cinder volume. When that integration is broken, the cluster looks healthy but apps can’t get storage or ingress. Verify from inside the cluster:
kubectl -n kube-system get pods | grep -E 'openstack|cinder'
kubectl -n kube-system logs <openstack-cloud-controller-pod>
A crash-looping cloud controller almost always means bad credentials in its secret or an unreachable Keystone endpoint from the cluster network. Fix the secret/endpoint and PVCs and LoadBalancers start working.
Step 5: Scaling and upgrades
Magnum scales by updating the Heat stack:
openstack coe cluster resize <cluster-name> 5
openstack coe cluster upgrade <cluster-name> <new-template>
resize adjusts the worker ResourceGroup — watch the Heat stack, because a resize that fails on capacity leaves the stack in UPDATE_FAILED and the cluster in a degraded state until you resolve the Nova/quota issue. Rolling upgrades replace nodes one at a time using the new template’s kube_tag; always upgrade a non-prod cluster first because a bad template can take out the whole fleet.
Where AI helps
Magnum failures are the worst kind of cross-service puzzle — the symptom is “cluster won’t come up” and the cause could be Heat, Nova capacity, an Octavia amphora, a registry egress rule, or a version-skewed label. I’ll paste the cluster show, the failed Heat events, and the cloud-init/kubelet log tail into a model and ask it to walk the chain and tell me the first thing that broke — infra vs. bootstrap vs. integration. It’s good at catching a kube_tag/cloud_provider_tag skew or an image-pull hang that I’d otherwise dig for.
Keep a saved Magnum triage prompt and lean on our other OpenStack guides — Heat, Octavia, and the networking guide especially, because a Magnum cluster is only as healthy as the services underneath it. The model reads the chain; you run every command yourself after you’ve understood what it does.
Generated commands and templates are assistive, not authoritative. Always verify against your own deployment before running anything in production.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.