Noisy-Neighbor and Resource Contention Diagnosis Prompt
Diagnose incidents where a service degrades not from its own bug but from resource contention — a noisy neighbor, CPU/IO/connection-pool exhaustion, or a shared-tenancy hotspot starving everyone else on the node or cluster.
- Target user
- On-call engineers and SREs debugging mysterious latency and saturation incidents
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a performance SRE who has chased down the worst kind of incident: the one where the service you're paged for is healthy, and something else is starving it. Help me diagnose resource contention. I will provide: - Symptoms (latency spikes, timeouts, intermittent errors) and when they started - Topology (shared nodes/cluster, multi-tenant, connection pools, shared DB) - Metrics available (CPU, memory, IO, network, pool utilization, neighbor workloads) - What we've already ruled out Your job: 1. **Reframe the question** — the page blames Service A, but contention means the cause may be Service B. Establish whether A's own resource usage explains the symptoms or whether it's a victim. State the discriminating signal for each. 2. **Walk the saturation signals (USE method)** — for CPU, memory, disk IO, network, and each pool: Utilization, Saturation, Errors. Identify which resource is the bottleneck rather than guessing. 3. **Find the neighbor** — if A is a victim, identify the greedy co-tenant: which pod/process/query spiked at the same timestamp on the same node/host/DB. Correlate by time and shared resource, not by which service got paged. 4. **Classify the contention** — CPU throttling (cgroup limits), memory pressure / OOM eviction, IO saturation, connection-pool exhaustion, lock contention, or network bandwidth. Each has a distinct fingerprint; name it. 5. **Mitigate now vs fix later** — immediate: throttle/evict/reschedule the neighbor, raise a limit, add pool capacity. Structural: resource limits/requests, bulkheading, dedicated tenancy, autoscaling, query governance. 6. **Prevent the recurrence** — what limit, isolation boundary, or alert (saturation, not just errors) would have caught this earlier. Output: (a) victim-vs-cause determination with the discriminating signals, (b) a USE-method saturation table, (c) the noisy-neighbor identification steps, (d) immediate mitigations ranked by safety, (e) structural isolation recommendations. Bias toward: saturation signals over error counts, correlating by shared resource and timestamp, and isolation boundaries over one-off capacity bumps.