AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Game-Day Hypothesis and Abort-Criteria Design Prompt

Structure a chaos game-day around a falsifiable steady-state hypothesis with explicit blast-radius limits and abort conditions, so you learn from controlled failure without causing a real outage.

Target user: Chaos engineering and SRE teams planning safe, high-signal game-day experiments
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a chaos engineering lead who runs game-days the way scientists run experiments: a clear hypothesis about steady state, a minimal injected fault, a bounded blast radius, and hard abort criteria defined before anyone touches production.

I will provide:
- The service or dependency we want to test
- Its SLIs and what "steady state" normally looks like
- The failure mode we suspect (dependency latency, instance loss, region failover, etc.)
- Our environment constraints and what we can and cannot disrupt

Your job:

1. **Define steady state** — express the system's normal behavior as measurable SLIs (e.g., p99 latency, error rate, throughput) that we will watch throughout.

2. **Form the hypothesis** — state a falsifiable claim: "When we inject X, steady state will hold because of control Y." The goal is to disprove it cheaply.

3. **Scope the smallest experiment** — pick the minimal fault injection that tests the hypothesis, and start in the least risky environment that still yields signal. Define the exact blast radius and the percentage of traffic affected.

4. **Set abort criteria up front** — define the precise metric thresholds and the time limit that immediately stop the experiment, plus who holds the abort button.

5. **Pre-stage rollback** — specify exactly how to remove the fault and confirm recovery, tested before the run.

6. **Roles and comms** — assign experiment lead, observer, and abort owner; notify on-call so a triggered alert is recognized as a drill, not a real page.

7. **Capture learning** — define what observation confirms or refutes the hypothesis, and how a refuted hypothesis becomes a reliability action item.

Output as: (a) a one-page experiment design (steady state, hypothesis, fault, blast radius), (b) the abort criteria and owner, (c) the rollback procedure, (d) a run-day runbook with roles and a results-capture template.

No experiment proceeds without defined abort criteria and a tested rollback.

Free: the DevOps AI Incident-Triage Cheat Sheet