Generating Game-Day Chaos Scenarios With AI Your Team Hasn't

The problem with running game days for a few years is that you run out of imagination. The scenarios start to repeat — the database is down, the region is gone, the deploy went bad — and the team gets good at exactly those and nothing else. Then a real incident arrives shaped like none of them, and all that practice does not transfer. I hit this wall last year and started using AI to generate game-day scenarios I would never have thought of, and the quality of our exercises jumped. The key is that AI authors the scenario; it never touches the actual system.

Why scenario variety matters more than fidelity

A game day is only as valuable as the thinking it forces. If your team can solve the scenario from muscle memory, they learn nothing. The goal is to present situations that are realistic enough to take seriously but novel enough to require genuine reasoning. Generating that steady stream of fresh, plausible scenarios is hard for a human facilitator, because we all have blind spots and favorite failure modes.

AI is unusually good at this. Describe your architecture and ask for failure scenarios, and it will combine failure modes in ways that are entirely plausible but outside your habitual set — a partial cache poisoning that interacts with a feature flag, a slow memory leak that only manifests after a traffic shift, a third-party certificate expiry that cascades into auth failures.

Generating scenarios tuned to your stack

Generic scenarios are boring and unrealistic. I feed a tool like Claude a description of our actual services, dependencies, and recent near-misses, and ask for game-day scenarios specific to that architecture. The output is grounded in real components, which makes it land — the team recognizes the systems involved and engages seriously instead of dismissing it as a textbook example.

I ask for a spread of difficulty and a spread of incident types: pure technical failures, but also scenarios that stress communication, escalation, and decision-making under ambiguity. The communication failures are often the most valuable, because that is where real incidents actually go wrong.

Pro Tip: Ask the model to include “complications” that reveal themselves partway through the exercise — the on-call who is unreachable, the runbook that turns out to be outdated, the second unrelated alert. Real incidents are messy, and scenarios that stay clean train the wrong reflexes.

Building the facilitator’s script

A good game day needs more than a starting prompt. I have the model draft the full facilitator script: the opening situation, the injects to reveal at timed intervals, the expected responses, and the signals the team should be looking for. This turns a vague idea into something a facilitator can actually run, and it lets someone other than the resident chaos expert lead the exercise.

I keep these scripts and the generation prompts in my prompt workspace so the format is consistent and reusable. Over time we have built a library of AI-drafted, human-refined scenarios we can pull from and remix.

The human owns realism and safety

Here is the critical review step. Every AI-generated scenario gets vetted by a human before it runs. The model occasionally produces something technically impossible in our architecture, or a scenario whose “right answer” is subtly wrong, or one that would accidentally teach a bad practice. The facilitator’s job is to catch that and correct it. AI drafts the scenario; a human owns whether it is realistic and safe to use.

This review is also where domain knowledge gets injected. I know that a particular failure mode is more likely than the model assumes, or that a scenario should emphasize a weakness we are specifically trying to address this quarter. The model gives me raw material; I shape it into a training tool.

AI writes the scenario, never injects the fault

This is the absolute line, and it matters even more for chaos work than elsewhere. AI generates the scenario on paper; humans decide whether and how to inject anything into a real system. The model never gets connected to fault-injection tooling, never triggers a real failure, never touches production or even staging. It is a creative writing partner for the exercise, not an actor in it.

The reason is simple: fault injection is a production action with real blast radius, even in staging. An LLM with the power to inject faults is an LLM that can cause an outage based on a misjudgment. We run the actual chaos experiments with deliberate, human-controlled tooling and human-defined blast-radius limits. The AI’s contribution ends at the script. The free AI Incident Response Assistant is built around this same advisory-only philosophy.

Turning game-day results into improvements

After each exercise, I feed the timeline and the team’s actions back to the model and ask it to identify where response broke down and what gaps the scenario exposed. This synthesis is fast and surprisingly insightful — it notices that the team spent eight minutes on diagnosis that a runbook should have made instant, or that the escalation path failed silently. Those findings become real action items, owned and prioritized by humans.

This closes the loop: AI helps generate the scenario, and AI helps analyze the result, but humans run the exercise and own every improvement that comes out of it.

Conclusion

Game days lose their value when the scenarios go stale, and human facilitators run out of fresh, plausible failure modes. AI fixes that — generating varied, architecture-specific scenarios with realistic complications, drafting facilitator scripts, and synthesizing results afterward. Keep a human firmly in control of scenario realism and, above all, of any actual fault injection. The model is a writing partner for your exercises, never a participant in them. Explore more resilience practices in the incident-response category, or adapt scenario templates from our prompt packs.

Generating Game-Day Chaos Scenarios With AI Your Team Hasn't Seen