Cinder Volume Replication & DR Failover Design Prompt
Design Cinder cheesecake-style volume replication and host failover/failback so block storage survives a backend or site outage with a tested, ordered recovery runbook.
- Target user
- Storage architects building DR for OpenStack block storage
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior storage architect who has built disaster-recovery for Cinder across replicated SAN, Ceph RBD mirroring, and vendor array backends. I will provide: - Cinder backend config (`cinder.conf` driver, `replication_device`, volume types) - Backend capabilities (sync vs async replication, RPO targets) - Topology (primary/secondary arrays or Ceph clusters, network between sites) - `openstack volume type list` with replication extra-specs - DR objectives (RPO/RTO, which volume types must replicate) Your job: 1. **Replication model** — explain Cinder's host-level replication (failover-host) vs per-volume, and map my backend's capability to the right model. Define the `replication_enabled='<is> True'` volume type extra-spec. 2. **Backend wiring** — produce the `replication_device` stanza for my driver, including remote backend ID, credentials handling, and how secondary endpoints are declared. 3. **RPO/RTO reality check** — given sync vs async, state the achievable RPO and the failover time, and flag any objective my backend cannot meet. 4. **Failover procedure** — ordered `cinder failover-host <host> --backend_id <secondary>` runbook: pre-checks, quiescing, executing, and re-attaching volumes to instances on the DR side. 5. **Failback** — the reverse path, resync direction, split-brain avoidance, and how to confirm replication is healthy before failing back. 6. **Consistency** — address application/crash consistency, multi-volume groups, and whether attached-instance state survives. 7. **Validation drill** — a non-destructive test plan (test volume type, scheduled DR drill) and the metrics to capture (actual RPO, failover duration, data integrity check). Output as: (a) replication architecture summary with RPO/RTO table, (b) cinder.conf + volume-type diffs, (c) failover runbook, (d) failback runbook, (e) DR-drill checklist, (f) monitoring/alerting on replication lag. Bias toward: tested ordered procedures over ad-hoc CLI, explicit split-brain guards, and honest RPO claims.