OpenStack Error Guide: 'Node stuck in clean failed' Ironic provisioning failure
Ironic bare-metal node stuck in clean failed or clean wait? Diagnose failed cleaning steps, ramdisk boot issues, and maintenance recovery step by step.
- #openstack
- #troubleshooting
- #errors
- #ironic
Exact Error Message
$ openstack baremetal node show node-07 -c provision_state -c last_error
+-----------------+--------------------------------------------------------------+
| Field | Value |
+-----------------+--------------------------------------------------------------+
| provision_state | clean failed |
| last_error | Failed to clean node node-07: Timeout reached while cleaning |
| | the node. Please check if the ramdisk responded. |
+-----------------+--------------------------------------------------------------+
In the ironic-conductor log:
2026-06-27 20:18:33.451 14 ERROR ironic.conductor.utils
[req-5c1a9d44-7b10-4e02-9a31-8c2b1d3e4400 - - - - -] Node a3f1... moved to provision state
"clean failed" from state "clean wait"; target provision state is "available"
2026-06-27 20:18:33.451 14 ERROR ironic.drivers.modules.agent_base
Timeout reached while cleaning the node. Cleaning step erase_devices failed.
What the Error Means
When a bare-metal node is unprovisioned (deleted or moved toward available), Ironic runs cleaning — booting the node into the IPA (Ironic Python Agent) ramdisk and executing cleaning steps such as erase_devices to wipe disks. clean failed means a cleaning step did not complete: typically the node never booted the ramdisk, the agent never called back, or a step like disk erase timed out or errored.
A node in clean failed is held out of the available pool — it cannot be scheduled until cleaning succeeds or an operator intervenes. The error is almost always in the deploy/cleaning path infrastructure: the node could not PXE/iPXE boot the IPA ramdisk, networking to the agent failed, the agent could not reach the conductor’s callback URL, or a long-running step (full disk erase on large drives) exceeded the cleaning timeout. The disks may be partially wiped, so recovery must be deliberate.
Common Causes
- Ramdisk failed to boot — PXE/iPXE could not load the IPA kernel/ramdisk (DHCP, TFTP/HTTP, or boot-order issue).
- Agent could not call back — the IPA ramdisk booted but cannot reach the conductor’s callback URL (provisioning network/firewall).
- Cleaning step timeout —
erase_deviceson large disks exceeded the configured cleaning timeout. - BMC / power management flaky — IPMI/Redfish could not power-cycle the node reliably.
- Wrong or missing deploy images — the IPA kernel/ramdisk references are missing or incorrect in node
driver_info. - Hardware fault — a failing disk causes the erase step to error out.
- Provisioning network misconfigured — the node booted but onto the wrong VLAN/network with no route to the agent endpoint.
How to Reproduce the Error
Force a callback/boot failure during cleaning:
- Enroll a node and move it toward
availableso automated cleaning triggers (or run manual cleaning). - Block the node’s path to the conductor callback URL (or point
driver_infoat a missing ramdisk). - Trigger cleaning.
The node enters clean wait, the agent never reports back (or never boots), the cleaning timeout fires, and the node lands in clean failed. A disk-erase timeout on a very large drive reproduces the step-timeout variant.
Diagnostic Commands
Read-only. Establish where in the cleaning flow the node stalled.
# Node state, error, and power
openstack baremetal node show node-07 -c provision_state -c last_error -c power_state -c maintenance
openstack baremetal node show node-07 -c driver_info -c instance_info
# Is the node powered and reachable via its BMC?
openstack baremetal node power status node-07
+---------------+--------------+
| Field | Value |
+---------------+--------------+
| power_state | power on |
| provision_state | clean failed |
+---------------+--------------+
Read the conductor logs and check the provisioning ports:
# Kolla-Ansible
docker logs ironic_conductor 2>&1 | grep -i "clean failed\|clean wait\|Timeout reached" | tail
docker logs ironic_conductor 2>&1 | grep -i "$(openstack baremetal node show node-07 -f value -c uuid)" | tail
# Traditional packages
journalctl -u openstack-ironic-conductor | grep -i "clean failed\|Timeout reached" | tail
# Provisioning network port state
openstack baremetal port list --node node-07 -c address -c pxe_enabled
Step-by-Step Resolution
-
Read
last_errorto classify the failure. “Timeout reached … check if the ramdisk responded” means a boot or callback problem; a named step likeerase_devices failedmeans the step itself errored (often hardware or timeout). -
Confirm the node can PXE/iPXE the IPA ramdisk. Check that the deploy kernel/ramdisk in
driver_infoexist and that DHCP/TFTP/HTTP on the provisioning network are serving them. A node that never boots IPA will always time out. -
Verify agent-to-conductor reachability. The IPA ramdisk must reach the conductor’s callback URL over the provisioning network. Confirm the node’s PXE port is on the right network and no firewall blocks the callback endpoint.
-
Raise the cleaning timeout for large disks if
erase_devicesis timing out on multi-terabyte drives, then retry — a full secure-erase on big disks legitimately takes a long time. -
Clear the failure and retry cleaning. Put the node into maintenance to investigate if needed, then re-run cleaning once the cause is fixed:
# Re-attempt the cleaning/abort path openstack baremetal node maintenance set node-07 --reason "investigating clean failure" openstack baremetal node clean node-07 --clean-steps '[{"interface":"deploy","step":"erase_devices"}]' openstack baremetal node maintenance unset node-07 -
Verify the node reaches
available:openstack baremetal node show node-07 -c provision_stateA clean run moves it through
cleaning→clean wait→available.
Prevention and Best Practices
- Validate the provisioning network end to end (DHCP, TFTP/HTTP, callback URL reachability) before enrolling nodes, since most clean failures are boot/callback issues.
- Set cleaning timeouts appropriate to your largest disks so full
erase_devicesruns do not falsely time out. - Keep deploy kernel/ramdisk images current and confirm each node’s
driver_inforeferences valid, present images. - Verify BMC power control (IPMI/Redfish) works reliably per node before relying on automated cleaning, which depends on clean power cycles.
- Monitor
openstack baremetal node listfor nodes inclean failed/clean waitand alert, so stuck nodes are recovered before the available pool shrinks.
Related Errors
deploy failed— the deploy (not clean) phase of the same agent flow failed; same boot/callback root causes apply.clean waitthat never advances — the node is mid-cleaning and the agent has not yet reported; becomesclean failedat timeout.Timeout reached while waiting for callback from deploy ramdisk— the deploy-side analogue, covered in our /categories/openstack/ bare-metal guides.No valid host was foundfor a bare-metal flavor — a scheduling failure once cleaning has stranded nodes out of the pool.
Frequently Asked Questions
Is the disk fully wiped when cleaning fails?
Possibly partially. If erase_devices started before failing, the disks may be in an indeterminate state. Treat the node’s data as gone and re-run cleaning to a clean completion before reuse.
Why is the node held out of the available pool?
Ironic will not offer an uncleaned node for new deployments — that would risk leaking a previous tenant’s data. The node stays in clean failed until cleaning succeeds or an operator explicitly recovers it.
How do I know if it is a boot problem or a step problem?
Read last_error. “Check if the ramdisk responded” points to boot/callback. A named step (erase_devices failed) points to that step timing out or erroring, often on hardware.
Can I skip cleaning to recover faster? You can configure cleaning behavior, but skipping disk erase has security implications between tenants. Prefer fixing the boot/network/timeout cause and letting cleaning complete.
Does Kolla-Ansible change the recovery steps?
No. The Ironic workflow is identical; only log access differs — docker logs ironic_conductor versus journalctl -u openstack-ironic-conductor.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.