Troubleshooting Cinder Block Storage in OpenStack

Cinder is mostly invisible until a volume gets stuck. Then you have an instance that won’t boot because its root volume is attaching, a volume you can’t delete because it’s error_deleting, or a snapshot wedged in creating for an hour. After years of running OpenStack storage, I’ve learned that Cinder problems are almost always a mismatch between the Cinder database’s idea of reality and what the storage backend actually did.

This guide is the recovery playbook I reach for.

Read the state before you touch anything

Every Cinder recovery starts with the truth of the current state:

openstack volume show <volume-uuid> -f value -c status -c attachments
cinder list --all-projects | grep -iE 'error|attaching|detaching|creating'

Cinder’s status field is a state machine, and the stuck states tell you exactly what was happening when it failed:

attaching / detaching — a connection operation hung partway.
error_deleting — backend rejected the delete or the volume is still referenced.
creating — backend provisioning never completed.
in-use with no live attachment — a phantom attachment, the most common database/backend mismatch.

Step 1: Read the volume’s host and the right log

A volume is owned by a specific cinder-volume host. Find it:

cinder show <volume-uuid> | grep os-vol-host-attr

Then read that host’s log, grepping by UUID:

grep <volume-uuid> /var/log/cinder/cinder-volume.log
journalctl -u cinder-volume --since "1 hour ago" | grep <volume-uuid>

The driver layer is where the real error hides — an LVM command that failed, a Ceph rbd permission error, a backend that returned an iSCSI target the host couldn’t reach.

Step 2: Distinguish control-plane stuck from data-plane stuck

If the API accepted the request but it never completed, you have two possibilities, and they recover differently.

Control-plane stuck — a cinder-volume or cinder-scheduler service died mid-operation. Check:

openstack volume service list

A down cinder-volume host leaves every in-flight operation hung. Restart the service and many volumes self-heal as the periodic task reconciles.

Data-plane stuck — the backend itself failed. For LVM, check the volume group; for Ceph, check the pool; for iSCSI, check the session:

vgs && lvs                              # LVM backend
rbd -p volumes ls | grep <uuid>         # Ceph backend
iscsiadm -m session                     # iSCSI connections

Step 3: Recover stuck states with reset-state — carefully

The hammer is cinder reset-state. It rewrites the database status field without touching the backend. That makes it powerful and dangerous: you are lying to Cinder about reality, so you must first confirm reality.

For a volume stuck detaching that is genuinely no longer attached:

# Confirm no attachment exists on the hypervisor first:
virsh domblklist <instance-name>
# Then, only if it's truly detached:
cinder reset-state --state available <volume-uuid>
cinder reset-state --attach-status detached <volume-uuid>

The golden rule: never reset-state to available while data-plane attachment still exists, or two instances can mount the same volume and corrupt it. Verify on the hypervisor with virsh domblklist before you reset.

Step 4: Clean up phantom attachments

The in-use volume with no real attachment is the classic. List attachments at the API level:

openstack volume attachment list --volume <volume-uuid>

If the API shows an attachment but the hypervisor virsh domblklist does not, the attachment record is orphaned. Delete the stale attachment record, then reset the volume:

openstack volume attachment delete <attachment-id>
cinder reset-state --state available <volume-uuid>

Using AI to plan recovery without breaking things

Cinder recovery is high-stakes because the wrong reset-state corrupts data. This is a perfect place to use an LLM as a planning partner that never executes. I give it the volume status, the attachment list, the virsh domblklist output, and the driver log lines, then ask:

“Here is a volume’s Cinder status, its API attachment records, the hypervisor’s actual attached disks, and the cinder-volume driver log. Tell me whether the volume is truly attached, what the safe recovery sequence is, and explicitly flag any reset-state command that could risk data corruption. Do not assume — only conclude from the evidence I gave you.”

Forcing it to reason from the evidence — not from “usually you do X” — is what makes it useful here. I keep these recovery-planning prompts in my prompt library so the dangerous commands always get a second set of eyes first.

Prevention beats recovery

A few habits dramatically reduce stuck-volume tickets:

Keep cinder-volume services healthy and monitored. Most stuck states trace back to a service that flapped during an operation.
Match your backend timeouts to reality. A slow Ceph cluster with default Cinder timeouts produces error states that aren’t really errors.
Audit periodically. cinder-manage volume and backend-side listings catch orphaned LVs and RBD images before they pile up.
Treat reset-state as a scalpel, not a default. If you’re reaching for it weekly, fix the underlying backend or service stability.

For more storage and recovery prompts tuned to OpenStack, see our OpenStack guides. The pattern that keeps your data safe is always the same: confirm what the backend actually did before you tell the database anything.

AI recovery plans are assistive, not authoritative. Verify the true attachment state on the hypervisor before running any reset-state command.