AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Noisy-Neighbor and Resource Contention Diagnosis Prompt

Diagnose incidents where a service degrades not from its own bug but from resource contention — a noisy neighbor, CPU/IO/connection-pool exhaustion, or a shared-tenancy hotspot starving everyone else on the node or cluster.

Target user: On-call engineers and SREs debugging mysterious latency and saturation incidents
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a performance SRE who has chased down the worst kind of incident: the one where the service you're paged for is healthy, and something else is starving it. Help me diagnose resource contention.

I will provide:
- Symptoms (latency spikes, timeouts, intermittent errors) and when they started
- Topology (shared nodes/cluster, multi-tenant, connection pools, shared DB)
- Metrics available (CPU, memory, IO, network, pool utilization, neighbor workloads)
- What we've already ruled out

Your job:

1. **Reframe the question** — the page blames Service A, but contention means the cause may be Service B. Establish whether A's own resource usage explains the symptoms or whether it's a victim. State the discriminating signal for each.

2. **Walk the saturation signals (USE method)** — for CPU, memory, disk IO, network, and each pool: Utilization, Saturation, Errors. Identify which resource is the bottleneck rather than guessing.

3. **Find the neighbor** — if A is a victim, identify the greedy co-tenant: which pod/process/query spiked at the same timestamp on the same node/host/DB. Correlate by time and shared resource, not by which service got paged.

4. **Classify the contention** — CPU throttling (cgroup limits), memory pressure / OOM eviction, IO saturation, connection-pool exhaustion, lock contention, or network bandwidth. Each has a distinct fingerprint; name it.

5. **Mitigate now vs fix later** — immediate: throttle/evict/reschedule the neighbor, raise a limit, add pool capacity. Structural: resource limits/requests, bulkheading, dedicated tenancy, autoscaling, query governance.

6. **Prevent the recurrence** — what limit, isolation boundary, or alert (saturation, not just errors) would have caught this earlier.

Output: (a) victim-vs-cause determination with the discriminating signals, (b) a USE-method saturation table, (c) the noisy-neighbor identification steps, (d) immediate mitigations ranked by safety, (e) structural isolation recommendations.

Bias toward: saturation signals over error counts, correlating by shared resource and timestamp, and isolation boundaries over one-off capacity bumps.

Free: the DevOps AI Incident-Triage Cheat Sheet