Cinder Volume Backups and Disaster Recovery in OpenStack

The most expensive lesson in OpenStack storage is learning, during an incident, that a snapshot is not a backup. A Cinder snapshot lives on the same backend as the volume — if the backend dies, the snapshot dies with it. Real disaster recovery means Cinder backups going to a separate store (Swift, NFS, or a remote Ceph pool) so you can rebuild after losing the primary backend entirely. I have rebuilt volumes from backups during a backend failure and I have also discovered, too late, that nobody had ever tested a restore. This guide is how I do both parts: the backups and the restores.

Snapshot versus backup, concretely

Both have a place, but they solve different problems. A snapshot is fast, same-backend, good for “oops I deleted a file.” A backup is slower, off-backend, and the only thing that survives losing the storage system.

openstack volume snapshot list --all-projects
openstack volume backup list --all-projects
openstack backup service list  # cinder-backup agents

If backup service list shows no up agents, you have no DR capability at all, regardless of how many snapshots exist. That is the first thing I confirm when I inherit a cloud, and it is wrong more often than you would hope.

Take backups that actually leave the backend

A backup writes the volume’s data to the configured backup driver. The first backup is full; subsequent ones can be incremental, which is what makes a daily schedule affordable.

openstack volume backup create --name db-full <volume-uuid>
openstack volume backup create --incremental --name db-incr <volume-uuid>
openstack volume backup show db-full -f value -c status -c size

A backup stuck in creating usually means the cinder-backup agent cannot reach the backup store, or the source volume is attached and the driver cannot get a consistent read. For attached volumes, either use --force knowingly or snapshot first and back up the snapshot for consistency.

Pro Tip: Incrementals chain off a parent. If you delete the wrong full backup, every incremental that depends on it becomes unrestorable. Track the chain and let Cinder manage dependencies — never hand-delete backups out of order.

Test the restore before you need it

An untested backup is a hope, not a plan. The restore path has its own failure modes, so rehearse it on a throwaway volume on a normal Tuesday.

openstack volume backup restore db-full --name db-restored
openstack volume show db-restored -f value -c status -c size
# Attach to a scratch instance and verify the filesystem mounts and data is intact

The restore creates a new volume by default, which is exactly what you want during DR — you do not overwrite the (possibly recoverable) original. Mount it, check the data, and only then decide to cut over. The teams that survive backend failures are the ones who have done this drill at least once.

Plan for losing the whole backend

True DR is restoring when the primary backend is gone. That means your backup store must be independent, and you must have the metadata to recreate volumes. Export the backup records so you can rebuild even if the Cinder database is lost.

openstack volume backup record export <backup-uuid>
# Save this record off-site; import it on a fresh cloud to recover the backup:
openstack volume backup record import <backup-service> <backup-url> <metadata>

The record export is the part everyone forgets. With it, you can import a backup into a freshly built Cinder, even a different deployment, and restore. Without it, your data may be sitting safely in Swift with no way to reference it.

Let AI build the runbook and the schedule

DR is procedure-heavy, and procedures are exactly what an AI assistant drafts well as a fast junior engineer. I describe my environment — backup driver, RPO target, which volumes are critical — and have the model draft the backup schedule script and a step-by-step restore runbook. It produces a solid first version that I then correct against my actual flavors and naming.

I never give it production credentials or let it execute anything. It writes the script; I read every line, test the restore on a scratch volume myself, and only then schedule it. A wrong --incremental parent or an out-of-order delete in a generated script can silently break a chain, so the human owns execution. The prompt library has runbook-generation prompts, the prompt workspace is where I iterate on the DR runbook, and the storage and DR prompt pack bundles the Cinder backup prompts I rely on.

openstack volume backup list -f value -c ID -c Name -c Status \
  | grep -v available   # the kind of state summary I hand the model

Gemma running locally is a nice fit here when I want to keep even sanitized infrastructure details fully on-prem while drafting runbooks.

Consistency: the backup that restores to garbage

The subtlest backup failure is one that succeeds and restores to a corrupt filesystem. Backing up an attached, actively-written volume captures an inconsistent point in time — like copying a database mid-transaction. For anything stateful, you need application or filesystem quiescing before the backup, not just --force.

# Quiesce inside the guest first (example: filesystem freeze), then snapshot,
# then back up the snapshot for a crash-consistent copy:
openstack volume snapshot create --volume <volume-uuid> --force snap-pre-backup
openstack volume backup create --snapshot snap-pre-backup --name db-consistent <volume-uuid>

Backing up from a snapshot rather than the live volume gives you a crash-consistent image, and quiescing the application first gives you an application-consistent one. The difference matters most exactly when you need the backup — during a real recovery, a database that was backed up mid-write may refuse to start, and you will not discover that until the worst possible moment. This is precisely why the restore drill is non-negotiable: it is the only thing that proves your backups are consistent, not just present.

Know your RPO and RTO before you design

A DR strategy is meaningless without numbers. RPO (how much data you can afford to lose) drives backup frequency; RTO (how long recovery may take) drives how the backups are stored and how the restore is automated. Designing the schedule without these is how teams end up with hourly backups they can never restore in time, or daily backups when the business needed minutes.

# Audit actual backup recency against your RPO target:
openstack volume backup list --all-projects -f value -c 'Name' -c 'Created At' \
  | sort -k2 | tail

If your RPO is one hour but the newest backup of a critical volume is twelve hours old, your strategy has already failed silently. I put that recency audit into the same monitoring that watches the backup agent, because the gap between “we take backups” and “we meet our RPO” is where most DR plans quietly die. Numbers first, then schedule, then the restore rehearsal that proves both.

Automate and monitor the schedule

A backup strategy nobody watches rots. Drive backups from a scheduler and alert when one fails, because a silently failing nightly backup is worse than no backup — it gives false confidence.

# In a cron-driven script, after each backup:
openstack volume backup show "$NAME" -f value -c status | grep -q available \
  || echo "BACKUP FAILED: $NAME" | logger -t cinder-backup

Wire that failure signal into your alerting so a stuck cinder-backup agent pages you within a day, not during the disaster.

Conclusion

Snapshots are convenience; backups are survival. Build Cinder backups that leave the primary backend, export the backup records off-site, and — most importantly — rehearse the restore before you need it. An AI assistant is a strong fast junior for drafting the backup scripts and the restore runbook, but it never touches production: keep credentials out, verify the chain logic, and run and test every restore yourself. More Cinder and storage guides live under the OpenStack category.