Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for OpenStack By James Joyner IV · · 10 min read

OpenStack Error Guide: 'Node stuck in clean failed' Ironic provisioning failure

Ironic bare-metal node stuck in clean failed or clean wait? Diagnose failed cleaning steps, ramdisk boot issues, and maintenance recovery step by step.

  • #openstack
  • #troubleshooting
  • #errors
  • #ironic

Exact Error Message

$ openstack baremetal node show node-07 -c provision_state -c last_error
+-----------------+--------------------------------------------------------------+
| Field           | Value                                                        |
+-----------------+--------------------------------------------------------------+
| provision_state | clean failed                                                 |
| last_error      | Failed to clean node node-07: Timeout reached while cleaning |
|                 | the node. Please check if the ramdisk responded.             |
+-----------------+--------------------------------------------------------------+

In the ironic-conductor log:

2026-06-27 20:18:33.451 14 ERROR ironic.conductor.utils
[req-5c1a9d44-7b10-4e02-9a31-8c2b1d3e4400 - - - - -] Node a3f1... moved to provision state
"clean failed" from state "clean wait"; target provision state is "available"
2026-06-27 20:18:33.451 14 ERROR ironic.drivers.modules.agent_base
Timeout reached while cleaning the node. Cleaning step erase_devices failed.

What the Error Means

When a bare-metal node is unprovisioned (deleted or moved toward available), Ironic runs cleaning — booting the node into the IPA (Ironic Python Agent) ramdisk and executing cleaning steps such as erase_devices to wipe disks. clean failed means a cleaning step did not complete: typically the node never booted the ramdisk, the agent never called back, or a step like disk erase timed out or errored.

A node in clean failed is held out of the available pool — it cannot be scheduled until cleaning succeeds or an operator intervenes. The error is almost always in the deploy/cleaning path infrastructure: the node could not PXE/iPXE boot the IPA ramdisk, networking to the agent failed, the agent could not reach the conductor’s callback URL, or a long-running step (full disk erase on large drives) exceeded the cleaning timeout. The disks may be partially wiped, so recovery must be deliberate.

Common Causes

  • Ramdisk failed to boot — PXE/iPXE could not load the IPA kernel/ramdisk (DHCP, TFTP/HTTP, or boot-order issue).
  • Agent could not call back — the IPA ramdisk booted but cannot reach the conductor’s callback URL (provisioning network/firewall).
  • Cleaning step timeouterase_devices on large disks exceeded the configured cleaning timeout.
  • BMC / power management flaky — IPMI/Redfish could not power-cycle the node reliably.
  • Wrong or missing deploy images — the IPA kernel/ramdisk references are missing or incorrect in node driver_info.
  • Hardware fault — a failing disk causes the erase step to error out.
  • Provisioning network misconfigured — the node booted but onto the wrong VLAN/network with no route to the agent endpoint.

How to Reproduce the Error

Force a callback/boot failure during cleaning:

  1. Enroll a node and move it toward available so automated cleaning triggers (or run manual cleaning).
  2. Block the node’s path to the conductor callback URL (or point driver_info at a missing ramdisk).
  3. Trigger cleaning.

The node enters clean wait, the agent never reports back (or never boots), the cleaning timeout fires, and the node lands in clean failed. A disk-erase timeout on a very large drive reproduces the step-timeout variant.

Diagnostic Commands

Read-only. Establish where in the cleaning flow the node stalled.

# Node state, error, and power
openstack baremetal node show node-07 -c provision_state -c last_error -c power_state -c maintenance
openstack baremetal node show node-07 -c driver_info -c instance_info

# Is the node powered and reachable via its BMC?
openstack baremetal node power status node-07
+---------------+--------------+
| Field         | Value        |
+---------------+--------------+
| power_state   | power on     |
| provision_state | clean failed |
+---------------+--------------+

Read the conductor logs and check the provisioning ports:

# Kolla-Ansible
docker logs ironic_conductor 2>&1 | grep -i "clean failed\|clean wait\|Timeout reached" | tail
docker logs ironic_conductor 2>&1 | grep -i "$(openstack baremetal node show node-07 -f value -c uuid)" | tail
# Traditional packages
journalctl -u openstack-ironic-conductor | grep -i "clean failed\|Timeout reached" | tail

# Provisioning network port state
openstack baremetal port list --node node-07 -c address -c pxe_enabled

Step-by-Step Resolution

  1. Read last_error to classify the failure. “Timeout reached … check if the ramdisk responded” means a boot or callback problem; a named step like erase_devices failed means the step itself errored (often hardware or timeout).

  2. Confirm the node can PXE/iPXE the IPA ramdisk. Check that the deploy kernel/ramdisk in driver_info exist and that DHCP/TFTP/HTTP on the provisioning network are serving them. A node that never boots IPA will always time out.

  3. Verify agent-to-conductor reachability. The IPA ramdisk must reach the conductor’s callback URL over the provisioning network. Confirm the node’s PXE port is on the right network and no firewall blocks the callback endpoint.

  4. Raise the cleaning timeout for large disks if erase_devices is timing out on multi-terabyte drives, then retry — a full secure-erase on big disks legitimately takes a long time.

  5. Clear the failure and retry cleaning. Put the node into maintenance to investigate if needed, then re-run cleaning once the cause is fixed:

    # Re-attempt the cleaning/abort path
    openstack baremetal node maintenance set node-07 --reason "investigating clean failure"
    openstack baremetal node clean node-07 --clean-steps '[{"interface":"deploy","step":"erase_devices"}]'
    openstack baremetal node maintenance unset node-07
  6. Verify the node reaches available:

    openstack baremetal node show node-07 -c provision_state

    A clean run moves it through cleaningclean waitavailable.

Prevention and Best Practices

  • Validate the provisioning network end to end (DHCP, TFTP/HTTP, callback URL reachability) before enrolling nodes, since most clean failures are boot/callback issues.
  • Set cleaning timeouts appropriate to your largest disks so full erase_devices runs do not falsely time out.
  • Keep deploy kernel/ramdisk images current and confirm each node’s driver_info references valid, present images.
  • Verify BMC power control (IPMI/Redfish) works reliably per node before relying on automated cleaning, which depends on clean power cycles.
  • Monitor openstack baremetal node list for nodes in clean failed/clean wait and alert, so stuck nodes are recovered before the available pool shrinks.
  • deploy failed — the deploy (not clean) phase of the same agent flow failed; same boot/callback root causes apply.
  • clean wait that never advances — the node is mid-cleaning and the agent has not yet reported; becomes clean failed at timeout.
  • Timeout reached while waiting for callback from deploy ramdisk — the deploy-side analogue, covered in our /categories/openstack/ bare-metal guides.
  • No valid host was found for a bare-metal flavor — a scheduling failure once cleaning has stranded nodes out of the pool.

Frequently Asked Questions

Is the disk fully wiped when cleaning fails? Possibly partially. If erase_devices started before failing, the disks may be in an indeterminate state. Treat the node’s data as gone and re-run cleaning to a clean completion before reuse.

Why is the node held out of the available pool? Ironic will not offer an uncleaned node for new deployments — that would risk leaking a previous tenant’s data. The node stays in clean failed until cleaning succeeds or an operator explicitly recovers it.

How do I know if it is a boot problem or a step problem? Read last_error. “Check if the ramdisk responded” points to boot/callback. A named step (erase_devices failed) points to that step timing out or erroring, often on hardware.

Can I skip cleaning to recover faster? You can configure cleaning behavior, but skipping disk erase has security implications between tenants. Prefer fixing the boot/network/timeout cause and letting cleaning complete.

Does Kolla-Ansible change the recovery steps? No. The Ironic workflow is identical; only log access differs — docker logs ironic_conductor versus journalctl -u openstack-ironic-conductor.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.