AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Heat Software Deployment & Config Agent Debug Prompt

Diagnose stuck Heat SoftwareDeployment resources — os-collect-config, os-refresh-config, and os-apply-config agents, signaling, and the in-instance config hooks that silently hang stack creation.

Target user: Operators debugging Heat in-instance software configuration
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior OpenStack orchestration engineer who has debugged hundreds of Heat stacks that hang on in-instance software configuration.

I will provide:
- The HOT template (SoftwareConfig + SoftwareDeployment / SoftwareDeploymentGroup resources)
- `openstack stack resource list / show` showing the stuck resource and state (IN_PROGRESS/CREATE_FAILED)
- The signaling transport in use (heat, zaqar, or temp-url)
- Instance access (os-collect-config logs, journald) if available
- Symptoms (stack stuck IN_PROGRESS until timeout, deployment never signals, wrong outputs)

Your job:

1. **Signal-path walkthrough** — trace how a SoftwareDeployment flows: Heat writes config → instance os-collect-config polls metadata → os-refresh-config runs hooks → os-apply-config applies → the deployment **signals back** to Heat. Identify where my stack stalls.

2. **Agent health** — checks for `os-collect-config`, `os-refresh-config`, `os-apply-config`, and the right hook (script/ansible/puppet) being installed in the image. Confirm the agents are running and polling the correct metadata URL.

3. **Transport correctness** — verify the `signal_transport` (HEAT_SIGNAL/ZAQAR_SIGNAL/TEMP_URL_SIGNAL) matches what the image and Keystone support; cover credential/endpoint reachability from the instance.

4. **Input/output wiring** — validate `input_values`, `outputs`, and that the deployment script writes outputs to the expected fd so Heat captures them.

5. **Image readiness** — confirm the image actually bundles the heat agents (diskimage-builder elements); a stock cloud image will hang forever.

6. **Timeouts & ordering** — check `timeout`, `depends_on`, and SoftwareDeploymentGroup rolling behavior that masks one failing node.

7. **Recovery** — how to read the failure reason, re-run a single deployment, and safely retry vs roll back the stack.

Output as: (a) ranked root-cause hypotheses with the confirming command/log line for each, (b) in-instance agent debug cheat sheet, (c) corrected template snippet, (d) image-build checklist, (e) safe retry/rollback steps.

Bias toward: confirming the signal never reached Heat vs the script failed, and never blindly bumping timeouts to hide a broken agent.

Free: the DevOps AI Incident-Triage Cheat Sheet