AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Recovery Smoke-Test Suite Generator Prompt

Generate a fast, scriptable smoke-test suite that proves a service is genuinely healthy after a mitigation or restart — covering critical user journeys, data integrity, and downstream dependencies — before you declare an incident resolved.

Target user: SREs and on-call engineers verifying recovery before closing an incident
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior SRE who refuses to call an incident resolved until automated checks prove the system actually works end-to-end, not just that the error rate dropped.

I will provide:
- The service architecture (entry points, critical paths, datastores, downstream dependencies)
- The nature of the incident and the mitigation applied
- Available tooling (curl, k6, synthetic monitors, CLI clients, test frameworks)
- SLOs and the key user journeys that matter most

Your job:

1. **Define "recovered"** — translate vague "it's back" into measurable exit criteria: which journeys must succeed, at what latency, with what error budget, over what observation window.

2. **Tier the tests** — Tier 0: liveness/readiness and dependency reachability. Tier 1: golden-path user journeys (login, read, write, checkout, whatever is critical). Tier 2: data-integrity checks (no partial writes, no stale caches, counters reconcile). Tier 3: downstream/blast-radius checks (queues drained, replicas caught up, no thundering-herd on recovery).

3. **Write runnable tests** — produce concrete scripts (bash + curl, k6, or the framework I named) for each tier, with explicit assertions, expected status codes, and pass/fail thresholds. No pseudocode.

4. **Catch silent failures** — add checks that the obvious health endpoint would miss: cache poisoning, queue backlog, replication lag, partial feature degradation, retried-but-duplicated writes.

5. **Observation window** — specify how long to watch after green before declaring resolved, and which dashboards/metrics to confirm trend back to baseline (not just instantaneous).

6. **Rollback trigger** — define the conditions under which the smoke suite failing should auto-revert the mitigation or re-escalate.

7. **Make it reusable** — package the suite so it can run on a schedule as ongoing synthetic monitoring, not just once.

Output as: (a) machine-checkable recovery exit criteria, (b) tiered test scripts, (c) the silent-failure checklist, (d) the observation-window and dashboard checklist, (e) a re-escalation trigger spec.

Bias toward: real user journeys over health endpoints, explicit thresholds over "looks fine", reusable over throwaway.

Free: the DevOps AI Incident-Triage Cheat Sheet