Ironic Bare Metal Deployment Debug Prompt
Diagnose Ironic deploys — node stuck in cleaning, deploy fails, PXE/iPXE issues, BMC unreachable, RAID config not applied.
- Target user
- OpenStack engineers running Ironic for bare-metal provisioning
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack engineer who has run Ironic — bare-metal provisioning at scale. You know that "deploy fails" can be PXE, iPXE, IPMI, neutron, image, or BMC firmware. I will provide: - The symptom (node stuck cleaning, deploy fails, BMC unreachable, RAID/BIOS not applied) - `openstack baremetal node show <id>` and `node validate` - Ironic logs (conductor, API, inspector) - The deploy interface (direct, agent, ansible) - Hardware vendor Your job: 1. **Verify node state machine**: - `enroll → manageable → available → active` - `cleaning` happens between active → available (drive wipe) - **Stuck cleaning** = drive wipe failed or BMC not responding 2. **For BMC connectivity**: - `node validate` checks BMC reachability - Test directly: `ipmitool -I lanplus -H <bmc-ip> -U user -P pass power status` - Common: wrong cipher, encrypted password, wrong BMC interface 3. **For PXE / iPXE**: - Bare metal node PXE-boots from neutron DHCP - iPXE chain to TFTP server - Common: DHCP option 67 (filename) wrong, TFTP server unreachable 4. **For deploy image**: - Ironic Python Agent (IPA) ramdisk runs on the node during deploy - Built with diskimage-builder - Must match the hardware (drivers, firmware) 5. **For RAID config**: - Set `raid_config` on node - `manage` state required to apply - Vendor-specific driver actions 6. **For BIOS / firmware updates**: - Built-in clean steps - Vendor driver support (Dell, HPE, Cisco UCS) - Failed firmware update can brick BMC — don't enable without testing 7. **For inspection**: - `inspect` action discovers hardware - Populates node properties (CPU, RAM, NICs) - Uses IPA ramdisk Mark DESTRUCTIVE: `node delete` while in active state, force BIOS update on production firmware, RAID config changes on existing data drives. --- Symptom: [DESCRIBE] Node state: ``` [PASTE `openstack baremetal node show <id>`] ``` `node validate` output: ``` [PASTE] ``` Deploy interface: [direct / agent / ansible] Hardware vendor: [Dell / HPE / Supermicro / etc.] Ironic conductor logs: ``` [PASTE] ```
Why this prompt works
Ironic spans many physical layers. This prompt walks them.
How to use it
- State the deploy interface and vendor.
- Test BMC directly with ipmitool.
- For PXE issues, check TFTP/DHCP separately.
- For RAID/BIOS, vendor docs essential.
Useful commands
# Node management
openstack baremetal node list
openstack baremetal node show <node-id>
openstack baremetal node validate <node-id>
openstack baremetal node set <node-id> --target-provision-state manage
openstack baremetal node manage <node-id>
openstack baremetal node provide <node-id>
openstack baremetal node clean <node-id> --clean-steps '...'
# BMC direct test
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass power status
ipmitool -I lanplus -H <bmc-ip> -U admin -P pass sensor list
# Deploy
openstack server create --flavor <baremetal-flavor> --image <image> --network <net> myvm
# Logs
sudo journalctl -u ironic-conductor -n 200 --no-pager
sudo journalctl -u ironic-api -n 100 --no-pager
sudo journalctl -u ironic-inspector -n 100 --no-pager
# IPA agent logs (during deploy/clean — on the node, captured to conductor)
# Look in ironic conductor logs for agent messages
# PXE / TFTP
sudo systemctl status xinetd # or tftpd
sudo tail /var/log/syslog | grep tftp
# Test PXE boot manually
# Reboot node; watch console for PXE messages
Common findings this catches
- BMC unreachable → wrong creds, wrong cipher, network issue.
- Cleaning stuck > 1h → drive failure during wipe; or IPA crashed.
- Deploy fails at PXE → DHCP/TFTP issue; verify with
tcpdump. - RAID not applied → driver doesn’t support; or node wrong state.
- Inspection misses NICs → IPA missing drivers for that hardware.
node validateshows interfaces=False → BMC test fails; check ipmitool.- Deploy succeeds but node not in Nova → conductor failed to notify Nova.
When to escalate
- Vendor firmware bugs — vendor support.
- Mass cleaning failures across same hardware model — IPA needs rebuild.
- Cluster-wide BMC creds rotation — coordinated change.
Related prompts
-
Linux Boot Failure & Rescue Prompt
Recover an unbootable Linux server — GRUB failures, broken initramfs, fstab errors, missing root, kernel panics — with a deliberate rescue sequence.
-
OpenStack Request-ID Log Trace Prompt
Correlate a single API request across services (nova-api → conductor → scheduler → compute → neutron → cinder) using OpenStack request IDs.
-
OpenStack VM Troubleshooting Prompt
Diagnose Nova VM boot failures, networking issues, and stuck instances using nova/openstack CLI output.