Scaling the Swift Proxy Tier With Memcache and AI

The 503 storm that taught me the most started with a single tenant uploading millions of small objects into one container. The proxy tier looked busy, the obvious fix was to relax ratelimit, and that “fix” promptly moved the overload onto the container servers and turned a slow cluster into a down one. Swift’s proxy tier is a careful balance of caching, rate limiting, and backend headroom, and the failure modes love to disguise themselves as each other. Here’s how I tune it now, and where AI genuinely speeds up the diagnosis without being allowed to drive.

The Proxy Pipeline Is the Whole Game

A Swift proxy isn’t a monolith; it’s a WSGI pipeline of middlewares, and the order of that pipeline changes behavior. Pull yours and read it top to bottom:

grep -A2 '^\[pipeline:main\]' /etc/swift/proxy-server.conf

You’ll see something like catch_errors gatekeeper healthcheck cache ratelimit authtoken keystoneauth ... proxy-server. Where ratelimit sits relative to authtoken decides whether you rate-limit anonymous probes or only authenticated requests. Where cache sits decides what gets memoized. A pipeline that looks fine but has these in the wrong order produces symptoms that mimic a sizing problem, which is how operators end up buying hardware to fix a config bug. The openstack category has the related Swift playbooks.

Memcache Is Load-Bearing, Not Optional

This is the part newcomers underestimate: Swift’s proxy uses memcache to store container and account existence info, token validation results, and ratelimit counters. Every cache miss for container info means the proxy goes back to the container layer to ask “does this container exist and what are its metadata?” — on a hot container, that’s a stampede.

Check whether memcache is actually keeping up:

echo "stats" | nc <memcache-host> 11211 | grep -E 'evictions|get_hits|get_misses|curr_connections'

High evictions or a poor hit ratio means your container/account info is being pushed out and the backend is paying for it on every request. The fixes are more memcache RAM, more memcache nodes, or a larger connection pool in memcache.conf — but you have to confirm evictions are the problem first.

Prompt: “Here are memcached stats from my 3 proxy-local memcache instances and my proxy-server.conf cache/ratelimit sections. Tell me whether evictions or low hit ratio are likely forcing container-info lookups to the backend, compute the hit ratio per node, and list the specific config values I’d change. Flag if any value would just hide a saturated backend. Do not give me commands to run against the live cluster.”

Output: It computed a 71% hit ratio on one node versus 96% on the others, traced the outlier to a much smaller connection_timeout/pool, and recommended pool and RAM changes — while explicitly warning that raising ratelimit thresholds at the same time could mask a backend that was already near saturation.

That eviction-and-hit-ratio cross-read is exactly the kind of fast junior-engineer work AI is great at. The warning it surfaced — don’t loosen ratelimit to hide a saturated backend — is the lesson my original 503 storm taught me the hard way. I still verify the backend headroom claim with swift-recon before acting.

Ratelimit: A Safety Valve, Not a Performance Knob

Swift’s ratelimit middleware throttles requests per account and per container to protect the backend. The temptation during an incident is to raise the limits so the 503s stop. Sometimes that’s right — if the backend has headroom. Often it’s catastrophic, because the 503s were the ratelimit doing its job, and removing it lets the hot container crush the container servers.

Before touching a single threshold, attribute the 503s:

grep ' 503 ' /var/log/swift/proxy.log | awk '{print $0}' | tail -50
swift-recon --async    # async pendings = backend falling behind
swift-recon -r -d      # replication + disk

If async pendings are climbing, your container/object servers are already behind, and loosening ratelimit will make it worse. If async pendings are flat and the 503s are pure ratelimit responses on one account, then raising that account’s limit is reasonable.

Pro Tip: Make the AI separate “503s from ratelimit (intended)” from “503s from backend overload (not intended)” before it recommends anything. A model that treats all 503s as a single number will cheerfully tell you to raise limits and hand you an outage.

Sizing the Proxy Tier

Proxies are mostly CPU-bound (TLS, erasure-code reconstruction on reads, hashing) and network-bound. The honest way to size is to load one proxy until it saturates, find the bottleneck, and scale horizontally behind your load balancer. Watch the proxy under real traffic:

swift-recon --diskusage --loadstats

When I’m capacity-planning, I’ll hand the per-node load and request-rate data to Claude and ask it to project how many proxies I need to hold a target request rate with headroom for an EC-read spike. That projection is a useful starting estimate — and it is only an estimate. I validate it with an actual load test before committing hardware, because the model doesn’t know my object-size distribution or TLS overhead unless I measure and tell it. Reusable capacity prompts live in the prompt workspace.

Rolling Changes Safely

The cardinal rule: change one proxy node at a time. Apply your memcache and pipeline tuning to a single node, leave it in the pool, and compare its 503 rate and latency against the unchanged peers under the same traffic. If it’s better, roll it; if it’s worse, you learned that on one node instead of the whole tier.

# after changing one node, compare its share of 503s to peers
grep ' 503 ' /var/log/swift/proxy.log | grep <tuned-node-ip> | wc -l

This A/B-against-your-own-tier approach is what makes tuning defensible. The AI can summarize the comparison across nodes for you, but the decision to roll is yours, made against real numbers.

Conclusion

Scaling a Swift proxy tier is less about adding nodes and more about understanding the loop between memcache hit ratios, ratelimit, and backend headroom. The disasters come from grabbing the ratelimit knob during an incident without attributing the 503s first. AI is genuinely fast at the reading: cross-tabulating memcache stats, separating intended from unintended 503s, projecting proxy counts. Every one of those is a summary you verify with swift-recon and a single-node A/B before you act. Keep the model reading and your changes incremental, and the proxy tier stays boring — which, for object storage, is exactly what you want. More Swift prompts live in the prompts library.