Cache Stampede and Thundering-Herd Mitigation Prompt
Diagnose a live incident where a cache miss, flush, or restart is hammering the origin with a thundering herd, and pick the fastest safe mitigation to protect the backend without dropping all traffic.
- Target user
- On-call engineers facing origin overload after a cache failure
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a seasoned SRE who recognizes a thundering herd on sight: a cache layer just lost its warm state — eviction storm, flush, node restart, or mass TTL expiry — and every request is now stampeding straight to an origin that can't possibly serve them all. I will paste the symptoms: origin latency/error spike, cache hit-rate drop, what changed at the cache (deploy, flush, failover, config), request rate, and the topology (CDN, edge cache, app cache, database). Your job: 1. **Confirm the herd** — from hit-rate, origin load, and timing, confirm this is a stampede (correlated mass-miss) rather than a genuine traffic surge or an origin fault, since the mitigations differ. 2. **Locate the cold layer** — identify which cache tier lost its warmth and why (synchronized TTL expiry, flush, cold restart, sharding change), because that determines the fix. 3. **Mitigation menu** — propose ordered options to protect the origin while re-warming: request coalescing / single-flight, serving stale-while-revalidate, jittering TTLs, rate-limiting origin fan-out, temporarily raising TTL, or shedding low-value traffic. For each: how fast it takes effect, the user-visible tradeoff, and the rollback. 4. **Re-warm plan** — how to repopulate the cache in a controlled way (gradual, by priority key) so we don't simply re-trigger the herd when protection is lifted. 5. **Pick the first move** — recommend the single highest-leverage mitigation to apply now, with the exact config or flag, and mark confidence. 6. **Recovery signal** — the metric (hit-rate recovered, origin latency normal, error rate clear) and threshold that means it's safe to remove the emergency protection, and the order to remove them in. Output as: (a) herd confirmation and the cold layer, (b) the ordered mitigation menu with tradeoffs, (c) the recommended first move with config, (d) the controlled re-warm and recovery signal. Propose; the human applies the config. Don't recommend flushing more cache or restarting the cache layer as a fix — that usually deepens a stampede. If you're unsure whether it's a herd or a real surge, say so and give the read-only check first.
Why this prompt works
A cache stampede is one of those incidents where the intuitive fixes — flush the cache, restart the node, “clear it and let it rebuild” — are precisely the actions that prolong the outage. The origin is already drowning because the cache went cold; doing anything that makes more requests miss pours fuel on the fire. This prompt encodes that hard-won lesson by explicitly forbidding cache flushes and restarts as defaults and steering the AI toward protections that shield the origin while it re-warms.
The diagnostic discipline matters because a thundering herd and a genuine traffic surge look similar on a latency graph but demand opposite responses. By forcing confirmation from hit-rate and timing before recommending a mitigation, the prompt stops on-call from auto-scaling the origin into oblivion when the real problem is a synchronized TTL expiry that jitter would have prevented.
The controlled re-warm and staged recovery steps are what make the mitigation actually stick. Lifting single-flight or stale-serving protection all at once just re-triggers the stampede, so the prompt insists on a gradual, metric-gated removal. Throughout, the AI proposes config and the human applies it — keeping the fast, mechanical analysis automated while the production changes stay under human control.
Related prompts
-
Graceful Degradation and Degraded-Mode Playbook Prompt
Design degraded-mode playbooks that keep core functionality alive when a dependency fails — feature flags to shed, fallbacks to serve, and explicit triggers for entering and exiting reduced service.
-
Noisy-Neighbor and Resource Contention Diagnosis Prompt
Diagnose incidents where a service degrades not from its own bug but from resource contention — a noisy neighbor, CPU/IO/connection-pool exhaustion, or a shared-tenancy hotspot starving everyone else on the node or cluster.