Recovering Stuck Cinder Volumes and Snapshots with AI Help

Every OpenStack operator eventually develops a sixth sense for the words deleting and attaching lingering in a status column for too long. A volume that has been “creating” for forty minutes is not creating anything. It is stuck, and the database and the storage backend have quietly disagreed about reality. I have spent more late nights than I would like reconciling those two, and somewhere along the way I started using an AI assistant to help me think through the state machine faster. It is good at that. It is also the last thing on earth I would let run cinder reset-state against production. Let me explain why, and how I actually recover these.

Stuck states are a lie about agreement

Cinder tracks volume state in its database. The actual bytes live in a backend — LVM, Ceph RBD, a vendor array. A “stuck” volume means an operation started, updated the DB to a transient state like creating or deleting, and then something — the backend, the message queue, a cinder-volume restart mid-flight — broke before the state could resolve. The DB now says deleting forever; the backend may or may not have actually deleted anything.

openstack volume list --long
openstack volume show <volume-id>

The first thing I do is figure out which side is out of sync. I will describe the situation to Claude — “volume in deleting for 30 minutes, cinder-volume restarted during the operation” — and ask it to reason about whether the backend object likely still exists. It is genuinely helpful as a fast junior engineer talking through the state machine. But it cannot see my Ceph cluster, so its conclusion is a hypothesis I then verify against the backend directly. The AI reasons; I confirm.

reset-state is a scalpel, not a cure

The command everyone reaches for is the state reset. It rewrites the database field. That is all it does. It does not touch the backend, does not finish the failed operation, does not clean anything up.

# legacy client:
cinder reset-state --state available <volume-id>

# openstack client:
openstack volume set --state available <volume-id>

Here is the part people skip: if you reset a volume stuck in deleting back to available, but the backend already deleted the actual disk, you now have a phantom — a DB record pointing at nothing. The next attach will fail in a far more confusing way. reset-state masks the symptom; it does not fix the cause. I say this to every junior I train, and I say it to the AI in my prompt context, because models love to suggest reset-state as a one-line fix. It is not a fix. It is a manual override that says “I, the human, have verified the real state and am forcing the DB to match it.”

Pro Tip: Before any reset-state, write down what you believe the true backend state is and why. If you cannot articulate it, you are not ready to reset — you are guessing, and guessing with reset-state corrupts your inventory.

Orphaned attachments: the detach that never finished

Volumes stuck in attaching or detaching usually have an orphaned attachment record. The Nova/Cinder handshake failed partway, leaving an attachment row that points at an instance the volume is not really connected to.

openstack volume attachment list --volume <volume-id>
openstack volume attachment show <attachment-id>

If you find an attachment to an instance that no longer exists, or a detaching volume whose instance was deleted, that orphan is blocking every future operation. The clean path is to complete or delete the attachment record once you have confirmed the instance side is truly gone:

openstack volume attachment delete <attachment-id>

I use AI to correlate the attachment list against openstack server list output — feed it both tables, ask which attachments reference dead instances. That diffing is tedious and the model nails it in seconds. The decision to delete an attachment, though, is mine, made after I have confirmed on the compute host that no actual block device is still mapped. Deleting an attachment that is still live detaches a disk out from under a running workload. The AI does not get to make that call. I keep the correlation prompt itself in a prompt pack so the framing stays consistent across operators.

DB versus backend: go look at the actual storage

When the DB says one thing, I go ask the backend what it really has. For LVM:

sudo lvs | grep <volume-id>

For Ceph RBD:

sudo rbd -p volumes ls | grep <volume-id>
sudo rbd -p volumes info volume-<volume-id>

Now I have ground truth. If the RBD image exists but Cinder says error_deleting, the delete failed and I can retry it cleanly. If the image is gone but Cinder still lists the volume, that is a case where reset-state-to-deleting-then-delete is legitimate, because I have personally confirmed the backend is already clean. The backend logs tell the story of why it broke:

sudo journalctl -u cinder-volume -n 200
sudo tail -f /var/log/cinder/cinder-volume.log

I will paste a backend traceback into an AI and ask it to explain the failure mode — a Ceph permission error, an LVM lock, a timeout. It is a fast, well-read assistant for decoding stack traces, and it routinely saves me a trip through the source. I verify its read against the actual log, always, because a confidently wrong explanation of a storage error can send you down an hour-long dead end. For repeatable scenarios I lean on a small set of diagnostic prompts tuned for storage backends.

Snapshots stuck in error_deleting

Snapshots have their own wedged state, error_deleting, and they are nastier because a snapshot can hold a dependency on its parent volume. You cannot always just force it.

openstack volume snapshot list --long
openstack volume snapshot show <snapshot-id>
cinder snapshot-reset-state --state available <snapshot-id>

The order matters: confirm the backend snapshot object (for RBD, rbd snap ls volumes/volume-<id>), check whether the parent volume has a dependent clone, then decide. Reset a snapshot to available and delete it while the backend still holds a child reference, and you can orphan storage that nothing in Cinder tracks anymore — silent capacity loss you will rediscover months later when the pool fills. This is precisely the kind of multi-step dependency reasoning where I describe the chain to an AI to make sure I have not missed an edge, while keeping every actual command in my own hands.

Confirm the service is even running

Before blaming a volume, confirm cinder-volume is up for its backend. A down service makes every volume on that backend look stuck when the real problem is the service.

openstack volume service list

Look for the backend host showing down. Restart the service, give it a moment to reconcile, and a surprising number of “stuck” volumes resolve themselves without any reset. The AI is good at reminding me to check this first — the fast junior catching the obvious thing I skipped at hour three of an incident. When a real incident is in play, I run it through our incident response process with a human approval gate, and I have the changes reviewed via code review before anything touches an automation script. The model advises inside those guardrails; it never holds the credentials.

Conclusion

Stuck Cinder states are a disagreement between the database and the backend, and the only safe recovery is to establish ground truth on the storage side before you override anything. reset-state is a manual confirmation, not a repair — use it after you know the truth, never to make a status look nice. Let AI accelerate the correlation, the log decoding, and the dependency reasoning, but keep the destructive commands, the production cloud, and the clouds.yaml firmly out of its reach. It is a fast junior engineer, and you are still the one accountable for the data. More in the OpenStack category.