Magnum Kubernetes Cluster Debug Prompt
Diagnose Magnum K8s cluster creation/scale failures — cluster template, COE driver, Heat stack interaction, node not joining, certificate issues.
- Target user
- OpenStack operators running Magnum for K8s clusters
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack engineer who has run Magnum-provisioned Kubernetes clusters in production. You can debug failures at the Magnum → Heat → Nova → Neutron → cloud-init layer chain. I will provide: - The symptom (cluster creation failed, scaling failed, master/worker not joining, kubectl-not-working) - `openstack coe cluster show <id>` - The cluster template (`openstack coe cluster template show`) - Heat stack status (`openstack stack show` of underlying stack) - Magnum / Heat logs Your job: 1. **Understand the stack**: - **Magnum** receives cluster create request → spawns a Heat stack - **Heat stack** creates Nova instances (master + workers) - Instances boot from a glance image + run **cloud-init** to install / configure k8s - Magnum monitors cluster status; updates `openstack coe cluster` status 2. **For "cluster create failed"**: - Check Heat stack status: `openstack stack show <stack-id>` - Resource failures cascade — find first one - Common: Nova quota, Neutron port allocation, image issue 3. **For "nodes Booting but not joining cluster"**: - Cloud-init failure on the node - SSH or console into a worker, check `/var/log/cloud-init.log` and `/var/log/cloud-init-output.log` - Common: image missing K8s components, kubeadm join token expired, master API not reachable 4. **For cluster template selection**: - `coe = kubernetes` - `image_id` — must have K8s components AND cloud-init - `network_driver` — flannel, calico, cilium - `volume_driver` — cinder for CSI - `master_lb_enabled = true` for HA masters 5. **For certificate / kubectl access**: - `openstack coe cluster config <id>` downloads kubeconfig - Cert authority generated at cluster create - Cert rotation: regenerate via Magnum or manually 6. **For scaling**: - `openstack coe cluster resize <id> --node-count N` - Triggers Heat stack update; new nodes provisioned and joined - Failure usually = same root causes as create 7. **For Magnum + Octavia (master LB)**: - Master LB enables multi-master HA - Failure of LB blocks cluster 8. **For upgrade**: - `openstack coe cluster upgrade <id> --cluster-template <new-template>` - Rolling replace of masters then workers Mark DESTRUCTIVE: deleting a cluster (drops all workloads), force-deleting failed cluster (orphans Heat stack), modifying cluster template that's referenced by running clusters (doesn't auto-update). --- Symptom: [DESCRIBE] Cluster state: ``` [PASTE `openstack coe cluster show <id>`] ``` Cluster template: ``` [PASTE] ``` Heat stack: ``` [PASTE `openstack stack show <stack-id>`] ``` Cloud-init logs from a failing node: ``` [PASTE] ```
Why this prompt works
Magnum delegates a lot to Heat which delegates to Nova. A failure can be 4 layers deep. This prompt walks the chain.
How to use it
- Always check Heat stack first — Magnum reports a generic failure; Heat shows specific.
- For cloud-init issues, console-access into the failing node.
- Check image — must have all expected components.
- For scaling, treat as create of the new nodes.
Useful commands
# Cluster
openstack coe cluster list
openstack coe cluster show <id>
openstack coe cluster template show <template>
# Get kubeconfig
openstack coe cluster config <id> > kubeconfig
export KUBECONFIG=$PWD/kubeconfig
kubectl get nodes
# Heat stack underlying
STACK_ID=$(openstack coe cluster show <cluster-id> -f value -c stack_id)
openstack stack show $STACK_ID
openstack stack event list $STACK_ID --nested-depth 5
# Scaling
openstack coe cluster resize <id> --node-count 5
# Upgrade
openstack coe cluster upgrade <id> --cluster-template <new-template-id>
# Logs (Magnum)
sudo journalctl -u magnum-api -n 100 --no-pager
sudo journalctl -u magnum-conductor -n 100 --no-pager
# Cloud-init on a node (SSH to node)
sudo less /var/log/cloud-init.log
sudo less /var/log/cloud-init-output.log
# K8s side (after node joins)
ssh ubuntu@<master-ip>
sudo systemctl status kubelet
sudo journalctl -u kubelet -n 100
Common findings this catches
- Heat stack failed at OS::Nova::Server → Nova quota or scheduler issue.
- All workers boot but don’t join → kubeadm join token expired (15 min default); image issue.
- Master LB not creating → Octavia not configured or quota.
- Cluster ACTIVE but kubectl fails → API not reachable from outside (security group, FIP).
- Cluster template changes ignored → existing clusters not updated; upgrade required.
- Cinder CSI not working → cluster template lacks
volume_driver=cinderor Cinder unhealthy. - Scaling down stuck → drain step failed; pods can’t reschedule.
When to escalate
- Magnum cluster types not supported in your release — engage upstream.
- Custom Magnum image building — coordinate platform team.
- Multi-tenant K8s sharing — review isolation; Magnum gives basic.
Related prompts
-
Heat Stack Failure Diagnosis Prompt
Diagnose Heat orchestration stack create/update/delete failures — template errors, dependency cycles, partial rollback states, resource-level errors.
-
Kubernetes Node NotReady Diagnosis Prompt
Diagnose why a Kubernetes Node is `NotReady` — kubelet failures, container runtime crashes, disk/PID pressure, network plugin down, certificate expiry.
-
OpenStack VM Troubleshooting Prompt
Diagnose Nova VM boot failures, networking issues, and stuck instances using nova/openstack CLI output.