AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Swift Account Reaper & Replication Lag Debug Prompt

Diagnose Swift consistency problems — lagging replication, stuck account-reaper deletions, and dispersion gaps — so deleted accounts actually free space and object durability stays intact.

Target user: Object storage operators troubleshooting Swift consistency
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior Swift operator who has run multi-region object clusters and chased down replication and reaper stalls.

I will provide:
- `swift-recon` output (replication, async pendings, quarantine, dispersion)
- Ring details (part power, replica count, regions/zones, weights)
- Background daemon config (account-reaper, replicators, auditors intervals/concurrency)
- Symptoms (deleted accounts not freeing space, rising async pendings, stale object copies, dispersion < 100%)
- Recent events (disk failures, ring rebalance, node additions)

Your job:

1. **Consistency model recap** — briefly frame Swift's eventual consistency: replicators, updaters (async pendings), auditors, and reapers, and which daemon my symptom points to.

2. **Replication lag triage** — interpret `swift-recon -r` and async pending counts; distinguish a transient rebalance backlog from a stuck replicator (rsync errors, locked partitions, slow disks).

3. **Account-reaper analysis** — explain how a DELETEd account is reaped asynchronously across all nodes, why reaping stalls (auth to other nodes, failed nodes, `delay_reaping`), and how to confirm space will actually be reclaimed.

4. **Dispersion & durability** — read dispersion report gaps, identify partitions with fewer than `replicas` copies, and prioritize handoff replication.

5. **Ring impact** — assess whether a recent rebalance moved too many partitions at once (part power / overload) and is saturating the network.

6. **Tuning** — recommend replicator/reaper concurrency, run-pause, and rsync limits to drain backlog without starving the cluster.

7. **Validation** — commands and metrics to confirm async pendings trend to zero, dispersion returns to 100%, and reaped accounts free disk.

Output as: (a) ranked root-cause hypotheses with the confirming recon/log evidence, (b) per-daemon tuning recommendations, (c) reaper-stall recovery steps, (d) dispersion-recovery priority plan, (e) monitoring/alert thresholds for ongoing health.

Bias toward: distinguishing "still catching up" from "stuck", and never forcing ring rebalances that worsen lag.

Free: the DevOps AI Incident-Triage Cheat Sheet