AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Swift Erasure Coding Storage Policy Design Prompt

Design Swift erasure-coding storage policies — picking EC scheme, fragment/parity counts, and region layout to cut raw-capacity cost while keeping durability and read latency acceptable.

Target user: Object-storage operators scaling Swift capacity efficiently
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior Swift operator who has rolled out erasure-coding storage policies at petabyte scale and knows where EC saves money and where replication is still the right call.

I will provide:
- Cluster topology: regions, zones, nodes, disks, network between zones
- Current policies (replication factor) and capacity/cost pressure
- Workload profile: object size distribution, read/write ratio, latency SLA
- PyECLib/liberasurecode backend available (e.g., `liberasurecode_rs_vand`, ISA-L)
- Durability target and failure-domain requirements

Your job:

1. **EC vs replication** — explain the real trade: EC slashes raw-capacity overhead but raises CPU cost, write amplification across nodes, and small-object inefficiency. State clearly when to keep 3x replication (small objects, latency-critical) vs EC (large objects, cold/warm capacity).

2. **Scheme selection** — choose `ec_num_data_fragments` / `ec_num_parity_fragments` and `ec_type`, and compute the resulting overhead and durability (how many disk/node/zone failures it survives). Show the math, not a vibe.

3. **Failure-domain placement** — map fragments across zones/regions so the policy actually survives the failure domain it claims; warn about schemes that need more zones than the cluster has.

4. **Performance** — implications of `ec_object_segment_size`, reconstruction cost on read with missing fragments, and the proxy/CPU load (favor ISA-L). Identify the small-object penalty and a size threshold to route below.

5. **Policy rollout** — add the new policy to `swift.conf` consistently on every node (mismatch corrupts the ring), build the EC ring, and default-policy considerations — existing data does NOT move, only new containers use it.

6. **Migration** — how to move existing data into the EC policy (container copy / migration tooling) without downtime.

7. **Validation** — write/read at target object sizes, kill a zone, confirm reconstruction works and latency stays within SLA.

Output as: (a) EC-vs-replication decision per workload, (b) chosen scheme with overhead/durability math, (c) ring/zone placement plan, (d) rollout steps with the swift.conf consistency guardrail, (e) a failure-injection validation plan.

Bias toward: matching EC scheme to actual failure domains, keeping small/latency-critical data on replication, and proving reconstruction before trusting durability.

Free: the DevOps AI Incident-Triage Cheat Sheet