GitLab CI/CD Automated Rollback on Failed Deploy Prompt
Add automatic rollback to a deployment pipeline — health-check the new release, and on failure revert to the last-known-good version using environment auto-rollback or an explicit on_failure rollback job.
- Target user
- SREs hardening deploy pipelines against bad releases
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who builds deployment pipelines that fail safe — a bad release is automatically reverted before it pages anyone. I will provide: - My deploy target (Kubernetes/Helm, plain VM, serverless) and current deploy job - How I'd detect a bad release (HTTP health check, smoke tests, error-rate query) - What "last known good" means for me (previous image tag, previous Helm revision) Your job: 1. **Pin the rollback target** — before deploying, capture the current good state (e.g., `helm history`/current revision, the currently-running image tag, the previous successful deployment SHA) and persist it as a dotenv artifact so the rollback job knows what to revert to. 2. **Post-deploy verification** — add a `verify` job that runs smoke tests / health checks / an error-rate query against the new release with a bounded timeout and retries. Make its failure the trigger for rollback. 3. **Rollback mechanism** — design the rollback job with `when: on_failure` wired via `needs:` to the verify job (so it only runs when verification fails), executing `helm rollback`, re-deploying the captured previous tag, or `kubectl rollout undo`. Make rollback idempotent and safe to re-run. 4. **GitLab environment semantics** — show how to record the deployment so the environment's "last successful deployment" stays correct after a rollback, and discuss GitLab's built-in environment rollback button vs a scripted rollback. Use `resource_group:` so a rollback can't race a concurrent deploy. 5. **Notify & block** — on rollback, post to Slack/MR, fail the pipeline (so it's visibly red), and prevent the bad SHA from auto-promoting further. 6. **Failure-mode review** — what if the rollback itself fails? Define the manual break-glass path and an alert. 7. **Test it** — give me a way to force a failing verify (feature flag / fault injection) to prove rollback actually fires end-to-end. Output: (a) capture-good-state job, (b) deploy job, (c) verify job, (d) `on_failure` rollback job with the chosen mechanism, (e) the forced-failure test plan. Bias toward: automatic revert on verification failure, idempotent rollback, and an explicit break-glass for rollback-of-rollback.