Skip to content
CloudOps
Newsletter
All prompts
AI for OpenStack Difficulty: Intermediate ClaudeChatGPT

Heat Stack Failure Diagnosis Prompt

Diagnose Heat orchestration stack create/update/delete failures — template errors, dependency cycles, partial rollback states, resource-level errors.

Target user
OpenStack platform engineers and tenants deploying via Heat / HOT templates
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior OpenStack orchestration engineer with deep experience writing and debugging Heat (HOT) templates and recovering stuck stacks in production.

I will provide:
- A symptom: stack stuck in `CREATE_IN_PROGRESS` / `CREATE_FAILED` / `UPDATE_FAILED` / `DELETE_FAILED`
- Output of `openstack stack show <stack>` and `openstack stack event list <stack>`
- The HOT template (or the failing resource excerpt)
- Heat-engine log excerpts referencing this stack
- Backend service errors (Nova, Neutron, Cinder) that the stack triggered

Your job:

1. **Walk the resource graph**: list each resource in the template and its dependencies (explicit `depends_on` and implicit via `get_resource` / `get_attr`).
2. **Identify the failing resource** — the first one in topological order to error.
3. **Distinguish stack-level from backend-level errors**:
   - Heat-internal (template parse error, circular dependency, intrinsic function misuse, missing parameter)
   - Backend service rejection (Nova quota exceeded, Neutron port allocation fail, Cinder backend full)
   - Heat-to-backend transport (Keystone token expired mid-stack, RabbitMQ timeout)
4. **For a `DELETE_FAILED` stack**: identify which resource refuses to delete and why (FIPs in use, ports attached to live servers, volume snapshots blocking).
5. **For a stuck stack** (in_progress > 1h): identify whether the underlying service call completed (resource exists in Nova/Neutron) but Heat lost track, or whether the call genuinely never returned.
6. **Recommend recovery**:
   - Safe: `openstack stack update --existing` with a fixed template
   - Less safe: `openstack stack abandon` (orphans backend resources but unstucks Heat)
   - Last resort: manual cleanup of backend resources + `stack delete --force`

Flag every DANGEROUS recovery step explicitly.

---

Symptom: [DESCRIBE — stack status + how long]
OpenStack release: [yoga / zed / antelope / bobcat / caracal / dalmatian / epoxy]

`openstack stack show`:
```
[PASTE]
```

`openstack stack event list`:
```
[PASTE]
```

HOT template (full or failing resource):
```yaml
[PASTE]
```

Heat-engine log excerpt:
```
[PASTE]
```

Backend errors (if any):
```
[PASTE]
```

Why this prompt works

Heat failures look opaque because the user-visible status (CREATE_FAILED) is almost never the actual problem — the actual problem is one resource deep in the dependency graph that called Nova/Neutron/Cinder and got rejected. Templates with 50+ resources are common in production.

This prompt forces the model to topologically walk the resource graph and identify the first failure, rather than fixating on the user-facing status.

How to use it

  1. Get the event list firstopenstack stack event list <stack> --nested-depth 5 is the most information-dense single command. It shows each resource’s transitions in order.
  2. Filter Heat-engine logs by stack ID, not name — stacks can share names across projects.
  3. For DELETE failures, run openstack stack resource list <stack> — resources in DELETE_FAILED state are the culprits; others may already be gone.

Useful commands

# Stack & event view
openstack stack show <stack-id>
openstack stack event list <stack-id> --nested-depth 5
openstack stack resource list <stack-id>
openstack stack resource show <stack-id> <resource-name>

# Template validation (catch errors before submitting)
openstack orchestration template validate --template <template.yaml>

# Recovery (safe → less safe)
openstack stack update --existing --template <fixed.yaml> <stack-id>
openstack stack cancel <stack-id>         # cancel a stuck IN_PROGRESS
openstack stack abandon <stack-id>        # un-track from Heat (resources stay)
openstack stack delete --force <stack-id> # last resort

# Heat-engine logs (controller-side)
sudo journalctl -u heat-engine --since "30 min ago" | grep <stack-uuid>

Common findings this catches

  • OS::Nova::Server fails with cryptic Heat error → underlying Nova quota exceeded; visible only by tracing the request-id into nova-api.log.
  • Circular dependency between two OS::Neutron::Port resources → both reference each other via get_attr. Template-validate passes; runtime hangs.
  • OS::Heat::WaitCondition stuck — the in-instance script never POSTed to the WaitConditionHandle URL. Likely a metadata-service connectivity or cloud-init failure inside the VM.
  • DELETE_FAILED on a Network → still has ports attached (servers that Heat did not create but a user attached). Solution: detach those manually, then retry stack delete.
  • get_attr returns None → referenced resource not yet created, but dependency wasn’t explicit. Add depends_on:.
  • Stack stuck in UPDATE_IN_PROGRESS → heat-engine crashed mid-update; stack lock not released. Look for stack_lock table entries older than the engine restart.

Heat template anti-patterns this catches

  • No depends_on when one resource only reads a runtime attribute that the other publishes. Works most of the time, races sometimes.
  • OS::Heat::SoftwareDeployment without a corresponding SoftwareConfig — silent fail.
  • get_param references to a parameter not declared in the parameters: block — fails late.
  • str_replace with multi-line replacements containing the params: key character — produces malformed output.
  • Overly nested stacks (3+ levels) — debugging cross-stack outputsparameters becomes painful; consider flattening.

When to escalate

  • Any stack stuck for >2h with no progress in the event list — restart heat-engine only after confirming with team and checking stack_lock table.
  • stack abandon recommendations — confirm someone owns the orphan cleanup checklist. Abandoned stacks are a slow-growing source of cloud waste.
  • DB-level interventions (deleting rows in stack, resource, or stack_lock) — almost always the wrong answer; pull in Heat / DB on-call instead.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week