Heat Stack Failure Diagnosis Prompt
Diagnose Heat orchestration stack create/update/delete failures — template errors, dependency cycles, partial rollback states, resource-level errors.
- Target user
- OpenStack platform engineers and tenants deploying via Heat / HOT templates
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack orchestration engineer with deep experience writing and debugging Heat (HOT) templates and recovering stuck stacks in production. I will provide: - A symptom: stack stuck in `CREATE_IN_PROGRESS` / `CREATE_FAILED` / `UPDATE_FAILED` / `DELETE_FAILED` - Output of `openstack stack show <stack>` and `openstack stack event list <stack>` - The HOT template (or the failing resource excerpt) - Heat-engine log excerpts referencing this stack - Backend service errors (Nova, Neutron, Cinder) that the stack triggered Your job: 1. **Walk the resource graph**: list each resource in the template and its dependencies (explicit `depends_on` and implicit via `get_resource` / `get_attr`). 2. **Identify the failing resource** — the first one in topological order to error. 3. **Distinguish stack-level from backend-level errors**: - Heat-internal (template parse error, circular dependency, intrinsic function misuse, missing parameter) - Backend service rejection (Nova quota exceeded, Neutron port allocation fail, Cinder backend full) - Heat-to-backend transport (Keystone token expired mid-stack, RabbitMQ timeout) 4. **For a `DELETE_FAILED` stack**: identify which resource refuses to delete and why (FIPs in use, ports attached to live servers, volume snapshots blocking). 5. **For a stuck stack** (in_progress > 1h): identify whether the underlying service call completed (resource exists in Nova/Neutron) but Heat lost track, or whether the call genuinely never returned. 6. **Recommend recovery**: - Safe: `openstack stack update --existing` with a fixed template - Less safe: `openstack stack abandon` (orphans backend resources but unstucks Heat) - Last resort: manual cleanup of backend resources + `stack delete --force` Flag every DANGEROUS recovery step explicitly. --- Symptom: [DESCRIBE — stack status + how long] OpenStack release: [yoga / zed / antelope / bobcat / caracal / dalmatian / epoxy] `openstack stack show`: ``` [PASTE] ``` `openstack stack event list`: ``` [PASTE] ``` HOT template (full or failing resource): ```yaml [PASTE] ``` Heat-engine log excerpt: ``` [PASTE] ``` Backend errors (if any): ``` [PASTE] ```
Why this prompt works
Heat failures look opaque because the user-visible status (CREATE_FAILED) is almost never the actual problem — the actual problem is one resource deep in the dependency graph that called Nova/Neutron/Cinder and got rejected. Templates with 50+ resources are common in production.
This prompt forces the model to topologically walk the resource graph and identify the first failure, rather than fixating on the user-facing status.
How to use it
- Get the event list first —
openstack stack event list <stack> --nested-depth 5is the most information-dense single command. It shows each resource’s transitions in order. - Filter Heat-engine logs by stack ID, not name — stacks can share names across projects.
- For DELETE failures, run
openstack stack resource list <stack>— resources inDELETE_FAILEDstate are the culprits; others may already be gone.
Useful commands
# Stack & event view
openstack stack show <stack-id>
openstack stack event list <stack-id> --nested-depth 5
openstack stack resource list <stack-id>
openstack stack resource show <stack-id> <resource-name>
# Template validation (catch errors before submitting)
openstack orchestration template validate --template <template.yaml>
# Recovery (safe → less safe)
openstack stack update --existing --template <fixed.yaml> <stack-id>
openstack stack cancel <stack-id> # cancel a stuck IN_PROGRESS
openstack stack abandon <stack-id> # un-track from Heat (resources stay)
openstack stack delete --force <stack-id> # last resort
# Heat-engine logs (controller-side)
sudo journalctl -u heat-engine --since "30 min ago" | grep <stack-uuid>
Common findings this catches
OS::Nova::Serverfails with cryptic Heat error → underlying Nova quota exceeded; visible only by tracing the request-id into nova-api.log.- Circular dependency between two
OS::Neutron::Portresources → both reference each other viaget_attr. Template-validate passes; runtime hangs. OS::Heat::WaitConditionstuck — the in-instance script never POSTed to the WaitConditionHandle URL. Likely a metadata-service connectivity or cloud-init failure inside the VM.- DELETE_FAILED on a Network → still has ports attached (servers that Heat did not create but a user attached). Solution: detach those manually, then retry stack delete.
get_attrreturns None → referenced resource not yet created, but dependency wasn’t explicit. Adddepends_on:.- Stack stuck in
UPDATE_IN_PROGRESS→ heat-engine crashed mid-update; stack lock not released. Look forstack_locktable entries older than the engine restart.
Heat template anti-patterns this catches
- No
depends_onwhen one resource only reads a runtime attribute that the other publishes. Works most of the time, races sometimes. OS::Heat::SoftwareDeploymentwithout a correspondingSoftwareConfig— silent fail.get_paramreferences to a parameter not declared in theparameters:block — fails late.str_replacewith multi-line replacements containing theparams:key character — produces malformed output.- Overly nested stacks (3+ levels) — debugging cross-stack
outputs↔parametersbecomes painful; consider flattening.
When to escalate
- Any stack stuck for >2h with no progress in the event list — restart
heat-engineonly after confirming with team and checkingstack_locktable. stack abandonrecommendations — confirm someone owns the orphan cleanup checklist. Abandoned stacks are a slow-growing source of cloud waste.- DB-level interventions (deleting rows in
stack,resource, orstack_lock) — almost always the wrong answer; pull in Heat / DB on-call instead.
Related prompts
-
OpenStack Request-ID Log Trace Prompt
Correlate a single API request across services (nova-api → conductor → scheduler → compute → neutron → cinder) using OpenStack request IDs.
-
OpenStack VM Troubleshooting Prompt
Diagnose Nova VM boot failures, networking issues, and stuck instances using nova/openstack CLI output.
-
Terraform Module Review Prompt
Get a senior-engineer review of a Terraform module — variable hygiene, state safety, security defaults, drift resistance.