Running Gamedays and Chaos Experiments Without Breaking Production
Gamedays and chaos engineering find weaknesses before customers do. A veteran SRE's guide to safe experiments, blast-radius control, and AI-assisted planning.
- #incident-response
- #gameday
- #chaos-engineering
- #sre
- #resilience
- #on-call
The best incident is the one you’ve already rehearsed. Gamedays and chaos experiments are how you find your system’s weak points on a Tuesday afternoon with everyone awake and prepared — instead of at 3 AM with a real customer impact. After 25 years on-call, I trust a team that runs gamedays far more than one that’s “never had a problem.” The second team has problems; they just haven’t met them yet.
Here’s how to run these exercises so they teach you something without becoming the very outage you were trying to prevent.
Gameday vs. chaos engineering
They’re related but distinct:
- A gameday is a planned exercise where the team responds to a simulated or injected failure, often to test the human response: do the runbooks work, does paging fire, can the IC coordinate?
- Chaos engineering injects controlled failures (kill a node, add latency, drop a dependency) to test the system’s resilience, often more continuously.
You want both. Gamedays sharpen people; chaos experiments sharpen systems. Start with gamedays — they’re lower-risk and immediately reveal whether your incident process even works.
Always start with a hypothesis
Chaos without a hypothesis is just breaking things. Every experiment should state what you expect:
“If we kill one of the three checkout pods, we expect zero customer impact because the load balancer reroutes within 10 seconds.”
Now the experiment is a test. If the hypothesis holds, you’ve gained confidence. If it doesn’t, you’ve found a weakness cheaply. “We just want to see what happens” is how you cause a real incident with no learning to show for it.
Control the blast radius — this is the whole game
The discipline that separates chaos engineering from recklessness is blast-radius control:
- Start in staging. Validate the experiment and your safety mechanisms before production is ever in scope.
- Smallest possible scope first. One pod, not the whole service. One availability zone, not the region.
- Have an abort button. Before you inject anything, know exactly how to stop and roll back, and confirm it works. The stop button is more important than the experiment.
- Define steady state up front. Know your normal metrics so you can tell instantly when the experiment is causing real harm and abort.
- Pick low-traffic windows for early production runs, with the team watching live.
A chaos experiment that you can’t immediately stop is not an experiment — it’s an outage you scheduled.
Run the gameday like a real incident
For a response-focused gameday, treat it as real: page through the normal channel, declare an IC, use the runbooks, post status updates to a test channel. The point is to exercise the whole machine. You’ll discover the alert that doesn’t fire, the runbook with a stale command, the escalation path that dead-ends — and you’ll discover them safely.
Assign someone to observe and take notes. The findings are the deliverable, not the chaos itself.
Turn findings into fixes
A gameday that produces no action items either tested nothing or wasn’t honest about what broke. After every exercise, run a mini-postmortem: what surprised us, what was slower than expected, which runbook failed, what we’ll change. Track those action items like you would for a real incident — owned, scheduled, closed. The gameday’s value is entirely in the fixes it drives.
Where AI helps plan and debrief
AI is genuinely useful around gamedays, mostly in the planning and analysis — never as the thing pulling the trigger on production.
Designing experiments. Describe your architecture and ask the model to suggest failure scenarios you might not have considered and the blast-radius controls each one needs:
“Here’s our service architecture. Propose five chaos experiments ordered from lowest to highest risk. For each, state the hypothesis, the smallest viable blast radius, the steady-state metrics to watch, and the abort condition.”
Validating runbooks before the gameday. Paste a runbook and ask the model to flag stale-looking commands, missing rollback steps, and unlabeled destructive actions, so the gameday tests a runbook that’s already been sanity-checked.
Debriefing. Feed the exercise timeline and findings to a model to draft the mini-postmortem and action items, the same way you would for a real incident.
The guardrail holds here too: AI plans and analyzes, humans execute and own the abort decision.
We keep incident-response prompts for runbook auditing and debriefs, and the Incident Response tool helps you turn gameday findings into structured runbooks and postmortems.
The mindset
Resilience isn’t the absence of failure — it’s having met your failures on your own terms. Run small, hypothesis-driven experiments with a working abort button, treat gamedays like real incidents, and convert every surprise into a fix. Do it regularly and the 3 AM page becomes a thing you’ve practiced, not a thing you fear.
AI-suggested chaos experiments must be reviewed by humans who own the blast-radius and abort decisions. Never run a generated experiment in production unverified.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.