Troubleshooting Swift Object Storage Replication and 503s

I have a soft spot for Swift because it almost never pages me, and then when it does, the cause is rarely where I look first. A flood of 503 Service Unavailable responses usually means one node is sick, but the symptom shows up cluster-wide because the proxy fans every request out to multiple storage nodes. After enough 2 a.m. incidents, I stopped guessing and started reading the ring, the recon data, and the async pendings in a fixed order. This is that order.

Start with the proxy, not the storage nodes

When users report uploads failing, the proxy logs tell you which backend is timing out. The proxy is the only component that talks to clients, so its view is authoritative for “what did the user actually experience.”

# On a proxy node
tail -f /var/log/swift/proxy.log | grep -E ' 503 | 507 | ERROR '
swift-recon --md5

A 507 Insufficient Storage from a backend means a disk filled or unmounted. A 503 with ERROR with Object server usually means a storage node is unreachable or overloaded. The swift-recon --md5 check confirms every node agrees on the ring file’s checksum — a mismatch here is the single most common cause of mysterious, intermittent failures, because nodes are routing to disks that no longer exist in the current ring.

Read the ring before you trust anything

The ring is the source of truth for data placement. If it is unbalanced or out of sync, every other diagnosis is built on sand.

swift-ring-builder /etc/swift/object.builder
swift-ring-builder /etc/swift/object.builder search --device sdb1

Look at the balance percentage and the dispersion. A balance above a few percent after a rebalance means weights are wrong or you have too few partitions for your device count. I once spent an hour chasing replication lag that was really a ring that had never finished rebalancing after a node was added.

Pro Tip: Never edit a ring builder file on one node and copy it manually. Build the ring in one place, then push the identical .ring.gz to every node with config management. A divergent ring is the Swift equivalent of split-brain.

Find where replication is actually stuck

Swift’s durability comes from background replication. When it falls behind, you get stale reads and rising async pendings. Recon aggregates this across the cluster.

swift-recon --replication --object
swift-recon --async
swift-dispersion-report

The async pending count is your early warning system. A steadily climbing number means object updates to the container databases are backing up — often a single slow container server or a disk doing constant rebuild work. The dispersion report tells you what percentage of partitions have all their replicas, which is the number I actually quote to stakeholders during an incident.

Pin down the bad disk

Most Swift incidents trace to one disk. The audit and replication daemons log loudly when a device misbehaves, but the fastest signal is mount state and I/O wait.

swift-recon --diskusage --verbose
# On the suspect storage node
mount | grep /srv/node
grep -i 'unmounted\|not mounted\|errno' /var/log/swift/object.log

A disk that XFS remounted read-only after an error will accept reads but reject every write, producing exactly the intermittent 503 pattern. Unmount it cleanly, let the ring route around it, and replication will reconstruct the missing replicas onto healthy devices.

Where an AI assistant earns its keep

Swift error logs are dense and repetitive, and that is precisely the kind of text an LLM is good at compressing. I treat the assistant like a fast junior engineer: I paste a few hundred lines of object.log and proxy.log and ask it to cluster the errors by node and by error class, then propose which device is the common factor. It is genuinely good at spotting that every failing request touched node3/sdf1 when my eyes had glazed over.

What it does not get is my cluster’s topology, my ring weights, or production credentials. I give it sanitized logs and ring output, never a working swift-ring-builder file with real device serials, and never an admin token. The model proposes; I verify against swift-recon and run the destructive commands myself. If you want a head start, the prompt library has log-triage prompts, and the object-storage triage pack bundles the Swift-specific ones I actually use on call.

# The kind of sanitized snippet I hand the model
swift-recon --replication --object | sed 's/10\.0\.[0-9]*\.[0-9]*/REDACTED/g'

A tool like Claude is reliable for “summarize these 500 log lines and rank the suspect nodes,” but I never let it conclude “delete this partition.” Deletion in Swift is a ring-and-replication decision, and the AI cannot see the replica math.

Containers, accounts, and the listing layer

Object failures get the headlines, but a surprising share of Swift pain lives in the container and account layers — the databases that track what objects exist. When listings are slow or inconsistent, you are looking at SQLite databases on the container servers, not at object disks.

swift-recon --replication --container
swift-recon --replication --account
swift-get-nodes /etc/swift/container.ring.gz <account> <container>

swift-get-nodes is the tool I reach for most during a “this container lists wrong on different requests” incident — it tells you exactly which nodes and partitions hold the container database, so you can compare them and find the replica that drifted. A container database that fell behind replication shows different object counts depending on which replica answered, which is maddening until you realize it is a database-replication problem and not an object problem at all. Once you know the lagging node, you let container replication catch it up the same way you do for objects.

Capacity and the 95% cliff

Swift behaves well until disks approach full, and then it falls off a cliff because the system reserves headroom for replication and rebuilds. A node that crosses its fullness threshold starts returning 507, which the proxy translates into the 503s users see. Watching disk usage is therefore not housekeeping — it is outage prevention.

swift-recon --diskusage
swift-recon --diskusage | grep -E 'high|full|9[0-9]\.'

When any device passes roughly 90%, plan to add capacity or rebalance weight off it, because the last few percent are not usable headroom — they are the safety margin Swift needs to heal. I treat 85% as the “order more disk” line and 90% as the “page someone” line, and that discipline has kept me out of the 507 spiral more than once.

Recover without making it worse

Once you have the culprit, recovery is deliberate. The temptation is to force replication; the discipline is to let Swift heal at its own rate so you do not saturate the network.

swift-init object-replicator restart
swift-recon --replication --object   # watch the lag drain

If a partition is genuinely missing replicas, drop the device weight to zero and rebalance so Swift rebuilds elsewhere rather than waiting on a dead disk. Resist the urge to bump concurrency mid-incident unless you have spare network headroom — I have turned a recoverable slowdown into a full outage by flooding the replication network.

Conclusion

Swift fails quietly and recovers slowly, which makes a disciplined order of operations more valuable than raw speed. Read the proxy, trust the ring, measure replication, isolate the disk, then heal. An AI assistant is a superb log compressor and pattern-spotter in that loop — a fast junior who never tires of reading object.log — but the ring math and the destructive commands stay with you. Keep prod creds out of the prompt, verify every suggestion against swift-recon, and Swift will keep being the boring service it is supposed to be. More OpenStack playbooks live under the OpenStack category.