You are a senior OpenStack orchestration engineer with deep experience writing and debugging Heat (HOT) templates and recovering stuck stacks in production. I will provide: - A symptom: stack stuck in `CREATE_IN_PROGRESS` / `CREATE_FAILED` / `UPDATE_FAILED` / `DELETE_FAILED` - Output of `openstack stack show <stack>` and `openstack stack event list <stack>` - The HOT template (or the failing resource excerpt) - Heat-engine log excerpts referencing this stack - Backend service errors (Nova, Neutron, Cinder) that the stack triggered Your job: 1. **Walk the resource graph**: list each resource in the template and its dependencies (explicit `depends_on` and implicit via `get_resource` / `get_attr`). 2. **Identify the failing resource** — the first one in topological order to error. 3. **Distinguish stack-level from backend-level errors**: - Heat-internal (template parse error, circular dependency, intrinsic function misuse, missing parameter) - Backend service rejection (Nova quota exceeded, Neutron port allocation fail, Cinder backend full) - Heat-to-backend transport (Keystone token expired mid-stack, RabbitMQ timeout) 4. **For a `DELETE_FAILED` stack**: identify which resource refuses to delete and why (FIPs in use, ports attached to live servers, volume snapshots blocking). 5. **For a stuck stack** (in_progress > 1h): identify whether the underlying service call completed (resource exists in Nova/Neutron) but Heat lost track, or whether the call genuinely never returned. 6. **Recommend recovery**: - Safe: `openstack stack update --existing` with a fixed template - Less safe: `openstack stack abandon` (orphans backend resources but unstucks Heat) - Last resort: manual cleanup of backend resources + `stack delete --force` Flag every DANGEROUS recovery step explicitly. --- Symptom: [DESCRIBE — stack status + how long] OpenStack release: [yoga / zed / antelope / bobcat / caracal / dalmatian / epoxy] `openstack stack show`: ``` [PASTE] ``` `openstack stack event list`: ``` [PASTE] ``` HOT template (full or failing resource): ```yaml [PASTE] ``` Heat-engine log excerpt: ``` [PASTE] ``` Backend errors (if any): ``` [PASTE] ```

Why this prompt works

Heat failures look opaque because the user-visible status (CREATE_FAILED) is almost never the actual problem — the actual problem is one resource deep in the dependency graph that called Nova/Neutron/Cinder and got rejected. Templates with 50+ resources are common in production.

This prompt forces the model to topologically walk the resource graph and identify the first failure, rather than fixating on the user-facing status.

How to use it

Get the event list first — openstack stack event list <stack> --nested-depth 5 is the most information-dense single command. It shows each resource’s transitions in order.
Filter Heat-engine logs by stack ID, not name — stacks can share names across projects.
For DELETE failures, run openstack stack resource list <stack> — resources in DELETE_FAILED state are the culprits; others may already be gone.

Useful commands

# Stack & event view
openstack stack show <stack-id>
openstack stack event list <stack-id> --nested-depth 5
openstack stack resource list <stack-id>
openstack stack resource show <stack-id> <resource-name>

# Template validation (catch errors before submitting)
openstack orchestration template validate --template <template.yaml>

# Recovery (safe → less safe)
openstack stack update --existing --template <fixed.yaml> <stack-id>
openstack stack cancel <stack-id>         # cancel a stuck IN_PROGRESS
openstack stack abandon <stack-id>        # un-track from Heat (resources stay)
openstack stack delete --force <stack-id> # last resort

# Heat-engine logs (controller-side)
sudo journalctl -u heat-engine --since "30 min ago" | grep <stack-uuid>

Common findings this catches

OS::Nova::Server fails with cryptic Heat error → underlying Nova quota exceeded; visible only by tracing the request-id into nova-api.log.
Circular dependency between two OS::Neutron::Port resources → both reference each other via get_attr. Template-validate passes; runtime hangs.
OS::Heat::WaitCondition stuck — the in-instance script never POSTed to the WaitConditionHandle URL. Likely a metadata-service connectivity or cloud-init failure inside the VM.
DELETE_FAILED on a Network → still has ports attached (servers that Heat did not create but a user attached). Solution: detach those manually, then retry stack delete.
get_attr returns None → referenced resource not yet created, but dependency wasn’t explicit. Add depends_on:.
Stack stuck in UPDATE_IN_PROGRESS → heat-engine crashed mid-update; stack lock not released. Look for stack_lock table entries older than the engine restart.

Heat template anti-patterns this catches

No depends_on when one resource only reads a runtime attribute that the other publishes. Works most of the time, races sometimes.
OS::Heat::SoftwareDeployment without a corresponding SoftwareConfig — silent fail.
get_param references to a parameter not declared in the parameters: block — fails late.
str_replace with multi-line replacements containing the params: key character — produces malformed output.
Overly nested stacks (3+ levels) — debugging cross-stack outputs ↔ parameters becomes painful; consider flattening.

When to escalate

Any stack stuck for >2h with no progress in the event list — restart heat-engine only after confirming with team and checking stack_lock table.
stack abandon recommendations — confirm someone owns the orphan cleanup checklist. Abandoned stacks are a slow-growing source of cloud waste.
DB-level interventions (deleting rows in stack, resource, or stack_lock) — almost always the wrong answer; pull in Heat / DB on-call instead.

Heat Stack Failure Diagnosis Prompt

Why this prompt works

How to use it

Useful commands

Common findings this catches

Heat template anti-patterns this catches

When to escalate

Related prompts

OpenStack Request-ID Log Trace Prompt

OpenStack VM Troubleshooting Prompt

Terraform Module Review Prompt

Why this prompt works

How to use it

Useful commands

Common findings this catches

Heat template anti-patterns this catches

When to escalate

Related prompts

OpenStack Request-ID Log Trace Prompt

OpenStack VM Troubleshooting Prompt

Terraform Module Review Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet