Debugging Heat Orchestration Stacks in OpenStack

Heat is OpenStack’s orchestration engine — the thing that turns a YAML template into a coordinated graph of Nova instances, Neutron networks, Cinder volumes, and everything else, created in dependency order. When it works, it’s infrastructure-as-code for your private cloud. When it doesn’t, you get a stack stuck in CREATE_FAILED with a stack trace that points at a resource three levels deep in a nested template. After years of running Heat, I’ve learned the debugging discipline that turns those traces from terrifying into routine.

Understand the resource graph before you touch anything

A Heat stack is a directed graph. Each resource declares dependencies — implicitly through get_resource/get_attr references, or explicitly with depends_on. Heat walks the graph, creating resources when their dependencies are ready. Most failures are either a resource that genuinely failed to create, or a dependency that Heat couldn’t resolve in the order you expected.

Start by reading the stack’s state, not the template:

openstack stack list --nested
openstack stack show <stack-name>
openstack stack resource list <stack-name> --nested-depth 5

The --nested-depth flag is the single most useful thing here. A top-level stack often shows CREATE_FAILED while the real failure is in a nested stack. Walking the nesting tells you exactly which leaf resource broke.

Step 1: Find the actual failed resource

Once you’ve listed resources, drill into the one in *_FAILED:

openstack stack resource show <stack-name> <resource-name>
openstack stack event list <stack-name> --nested-depth 5 \
  --format value | grep -i fail

The resource_status_reason field is gold — it carries the underlying error from Nova, Neutron, or wherever. “No valid host was found” means it’s a scheduler/capacity problem, not a Heat problem. “Quota exceeded” means you hit a tenant limit. “Property error” means the template is wrong. Read that reason before forming any theory.

Step 2: The rollback trap

By default, when a stack fails to create, Heat rolls it back and deletes everything — which destroys the evidence you needed to debug. For development and incident work, disable rollback so the broken resources stay put:

openstack stack create -t template.yaml \
  --disable-rollback my-stack

Now when it fails, the half-built resources remain and you can inspect them directly. Just remember to clean them up. For an existing stack stuck mid-rollback, openstack stack show will report ROLLBACK_IN_PROGRESS or ROLLBACK_FAILED — a ROLLBACK_FAILED usually means a resource won’t delete (a volume still attached, a port still in use), and you’ll need to resolve that underlying resource first.

Step 3: Recovering a wedged stack

Stacks get stuck in *_IN_PROGRESS when the heat-engine handling them dies mid-operation. First confirm no engine is still working it:

openstack stack show <stack-name> -c stack_status
journalctl -u heat-engine --since "30 min ago" | grep <stack-id>

If nothing is actively processing it, you can mark the resource healthy or unhealthy and let Heat reconverge:

# Tell Heat a resource is bad so the next update rebuilds it:
openstack stack resource mark unhealthy <stack-name> <resource-name>
openstack stack update -t template.yaml <stack-name>

mark unhealthy followed by stack update is the cleanest recovery for a single broken resource — far better than deleting the whole stack. For a truly hung stack, openstack stack cancel <stack-name> (or stack update --rollback) interrupts the in-progress operation.

Step 4: Validate templates before you deploy

Most CREATE_FAILED stacks are caught before launch with a dry run:

openstack orchestration template validate -t template.yaml
openstack stack create -t template.yaml --dry-run my-stack

The --dry-run shows you the resources Heat would create without creating them. Combine that with parameter_defaults and get_param checks and you’ll catch the common errors — wrong property names, undefined parameters, circular depends_on — at zero cost.

Step 5: Tame nested-template complexity

Big environments end up with deeply nested templates and OS::Heat::ResourceGroup loops. Two rules keep them debuggable: keep nesting shallow (two or three levels max — anything deeper is a sign you should refactor), and use outputs deliberately so parent stacks pass clean values down instead of reaching into child internals. When a ResourceGroup of 20 instances fails, --nested-depth tells you it was instance 14 that hit a capacity wall, not all 20.

Where AI earns its keep

Heat stack traces are verbose and the real cause is often buried under generic “Resource CREATE failed” wrappers. I’ll paste the stack event list output and the failing resource show into a model and ask it to identify the root failed resource, translate the resource_status_reason into plain language, and tell me whether the fix is in the template or the environment. It’s reliably good at spotting a circular dependency or a typo’d property name that I’d skim past.

Keep a reusable Heat debugging prompt handy, and pair it with the rest of our OpenStack guides — because a failed Heat resource is usually really a Nova, Neutron, or quota problem wearing an orchestration costume. The model reads the trace; you run the recovery commands after you’ve understood them.

Generated commands and templates are assistive, not authoritative. Always validate against your own deployment before applying anything in production.