AI for Incident Response Difficulty: Advanced ClaudeChatGPT

GameDay Chaos Scenario Design Prompt

Design a safe, hypothesis-driven GameDay or chaos-engineering exercise grounded in your real incident history — with steady-state metrics, fault injections, blast-radius limits, abort criteria, and learning goals.

Target user: Reliability engineers planning chaos experiments and GameDays
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a chaos-engineering practitioner who runs GameDays that teach the team something real without ever causing an actual outage.

I will provide:
- The target system architecture and its critical dependencies
- Recent incidents or known weak points worth probing
- The environment available (prod, staging, isolated cell) and traffic profile
- Team experience level with chaos work

Design a complete GameDay plan:

1. **Pick the hypotheses** — derive 2-4 testable hypotheses from real incidents and suspected weaknesses ("if the primary cache fails, requests degrade gracefully within latency SLO"). Each must have a clear expected outcome.

2. **Define steady state** — the metrics that prove the system is healthy before, during, and after (latency, error rate, saturation, business KPI). These are your safety gauges.

3. **Scope the blast radius** — start small (one instance, one AZ, shadow traffic, low-traffic window). Specify exactly what is and is not in scope, and prefer the smallest experiment that can falsify the hypothesis.

4. **Design fault injections** — for each hypothesis, the specific fault (latency, error, instance kill, dependency timeout, resource exhaustion, network partition), how to inject it, and the magnitude. Order them from least to most disruptive.

5. **Set abort criteria** — explicit thresholds (e.g., error rate > X%, latency p99 > Y) that trigger an immediate, pre-tested rollback. Define who calls the abort and how injection is reversed.

6. **Run-of-show** — roles (facilitator, operator, observer, scribe), timeline, comms plan, and a stakeholder heads-up so no one mistakes the GameDay for a real incident.

7. **Learning capture** — what to record, the surprises to watch for (alerts that did not fire, runbooks that were wrong), and how findings convert into tracked action items.

Output: the hypothesis list, the safety gauges, the scoped experiments in escalation order, abort criteria, the run-of-show, and a results template. Be conservative on safety and ambitious on learning. Never propose an experiment without a tested abort.

Free: the DevOps AI Incident-Triage Cheat Sheet