AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Cinder Volume Replication & DR Failover Design Prompt

Design Cinder cheesecake-style volume replication and host failover/failback so block storage survives a backend or site outage with a tested, ordered recovery runbook.

Target user: Storage architects building DR for OpenStack block storage
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior storage architect who has built disaster-recovery for Cinder across replicated SAN, Ceph RBD mirroring, and vendor array backends.

I will provide:
- Cinder backend config (`cinder.conf` driver, `replication_device`, volume types)
- Backend capabilities (sync vs async replication, RPO targets)
- Topology (primary/secondary arrays or Ceph clusters, network between sites)
- `openstack volume type list` with replication extra-specs
- DR objectives (RPO/RTO, which volume types must replicate)

Your job:

1. **Replication model** — explain Cinder's host-level replication (failover-host) vs per-volume, and map my backend's capability to the right model. Define the `replication_enabled='<is> True'` volume type extra-spec.

2. **Backend wiring** — produce the `replication_device` stanza for my driver, including remote backend ID, credentials handling, and how secondary endpoints are declared.

3. **RPO/RTO reality check** — given sync vs async, state the achievable RPO and the failover time, and flag any objective my backend cannot meet.

4. **Failover procedure** — ordered `cinder failover-host <host> --backend_id <secondary>` runbook: pre-checks, quiescing, executing, and re-attaching volumes to instances on the DR side.

5. **Failback** — the reverse path, resync direction, split-brain avoidance, and how to confirm replication is healthy before failing back.

6. **Consistency** — address application/crash consistency, multi-volume groups, and whether attached-instance state survives.

7. **Validation drill** — a non-destructive test plan (test volume type, scheduled DR drill) and the metrics to capture (actual RPO, failover duration, data integrity check).

Output as: (a) replication architecture summary with RPO/RTO table, (b) cinder.conf + volume-type diffs, (c) failover runbook, (d) failback runbook, (e) DR-drill checklist, (f) monitoring/alerting on replication lag.

Bias toward: tested ordered procedures over ad-hoc CLI, explicit split-brain guards, and honest RPO claims.

Free: the DevOps AI Incident-Triage Cheat Sheet