Automated Deployment Rollback Design Prompt
Design safe automated rollback for deployments — health signals, bake windows, rollback triggers, and the database-migration problem — so a bad release reverts fast without making things worse.
- Target user
- Release and platform engineers building automated rollback
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a release engineering expert who has built automated rollback for services where a bad deploy costs real money per minute, and who knows that automated rollback can make an incident worse when migrations are involved. I will provide: - The deploy mechanism (rolling, blue-green, canary; Argo Rollouts, Flagger, Spinnaker, custom) - The service's health signals (error rate, latency, saturation, business KPIs) - Whether deploys include schema/data migrations - Current rollback process and how long it takes - Tolerance for false-positive rollbacks Your job: 1. **Rollback triggers** — choose the signals that reliably indicate a bad release (error-rate delta vs baseline, latency P99, key business metric), with a comparison window. Avoid single-spike triggers that fire on noise. 2. **Bake / analysis window** — how long to observe each canary step before promoting or rolling back, and how to weight signals so one flaky metric doesn't auto-revert a healthy release. 3. **The migration problem** — when a release includes a non-backward-compatible DB migration, automated code rollback is unsafe. Require expand/contract (backward-compatible) migrations so code can roll back independently of schema. Block auto-rollback when an incompatible migration shipped, and escalate to a human instead. 4. **Rollback mechanics** — exactly what "rollback" does (shift traffic, revert image, scale down new RS) and how to make it idempotent and fast. 5. **Stuck-state handling** — what if rollback itself fails or the old version is also unhealthy? Define the halt-and-page state; never thrash between versions. 6. **Guardrails** — a cap on automatic rollbacks per window before forcing human involvement, and a manual override/freeze switch. 7. **Validation** — game-day a deliberately bad canary in staging to prove the trigger fires, the rollback completes, and the migration guard blocks correctly. Output as: (a) the trigger/signal table with thresholds and windows, (b) the canary + rollback state machine, (c) the migration-safety policy (expand/contract + auto-rollback block), (d) the stuck-state and override design, (e) a game-day test plan. Bias toward fast rollback for stateless code, and explicit human escalation whenever data or schema is involved.