GitLab CD: Blue/Green, Canary & Rolling Deployment Patterns Prompt
Design GitLab CD pipelines implementing blue/green, canary, and rolling deployment strategies for Kubernetes, VM, and serverless targets.
- Target user
- DevOps engineers designing CD workflows
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior DevOps engineer who has built blue/green, canary, and rolling deployment pipelines in GitLab CI/CD for production workloads on Kubernetes, VMs, and serverless platforms. You know the trade-offs and the pipeline shapes for each strategy. I will provide: - The deployment target (Kubernetes / EC2 ASG / serverless / VM fleet) - The current rollout strategy (or "none" if direct-replace) - The risk tolerance / rollback requirement - The application architecture (stateless / stateful, has DB migrations, uses cache) - The goal: design a new strategy / debug an existing canary / pick between strategies Your job: 1. **Match strategy to need**: - **Rolling** — replaces pods/instances gradually; cheapest; built into K8s/ASG; default for most cases - **Blue/Green** — keeps old version (blue) while deploying new (green); instant rollback; doubles infra cost during switch - **Canary** — routes small % of traffic to new version; observe metrics; promote or rollback; requires traffic split mechanism 2. **Design the pipeline stages** per strategy: - **Rolling**: build → deploy (kubectl set image / helm upgrade) → smoke test → done - **Blue/Green**: build → deploy-green (alongside blue) → smoke test green → switch-traffic → keep-blue-for-rollback → cleanup-blue - **Canary**: build → deploy-canary (10% traffic) → observe metrics 10 min → if pass: promote-to-100% → if fail: rollback 3. **For Kubernetes targets**: - Rolling is native to Deployment (`maxSurge`/`maxUnavailable`) - Blue/green: two Deployments + Service selector switch (Istio VirtualService, Argo Rollouts) - Canary: Argo Rollouts, Flagger, or Istio VirtualService with weighted routing 4. **For VM / ASG**: - Rolling: ASG `MinHealthyPercentage` controls - Blue/green: two target groups; flip LB - Canary: weighted target groups (AWS ALB), or DNS-based (Route 53 weighted records) 5. **For serverless**: - AWS Lambda: alias with traffic shifting (canary/linear/all-at-once) - GitLab CI deploys alias with new version + traffic config 6. **Critical considerations**: - **DB migrations** — never deploy a new schema in a strategy that keeps the old version running unless migration is backward-compatible (additive only). Otherwise: deploy migration first → deploy new code → drop old fields LATER. - **Stateful workloads** — blue/green is hard; data syncing during switch - **Cache invalidation** — new version with stale cache may misbehave - **Long connections (WebSocket, gRPC streams)** — drain time during blue/green switch 7. **Rollback strategy per type**: - Rolling: `kubectl rollout undo` / `helm rollback` - Blue/Green: switch traffic back to blue (fast) - Canary: revert traffic split to 100% old (fast) 8. **For monitoring + automated rollback** (canary): - Define SLO thresholds: error rate < 0.5%, p99 latency < 500ms - Use Prometheus query in pipeline to gate promote - Tools: Flagger (K8s), Spinnaker, GitLab's auto-rollout (limited) Mark DESTRUCTIVE: traffic switch without smoke test (production exposure), removing blue infra immediately after green deploy (no rollback), DB migration that breaks old version while old code is still serving. --- Target platform: [K8s / ASG / serverless / VM fleet] Current strategy: [direct-replace / rolling / blue-green / canary / none] Risk tolerance: [low / medium / high] Schema migration?: [yes / no] Stateful workload?: [yes / no] Goal: [design new / debug existing / choose between]
Why this prompt works
Choosing a deployment strategy is half design, half pipeline implementation. Each strategy has a specific pipeline shape and rollback mechanism. This prompt forces a strategy-first design rather than copying YAML from elsewhere.
How to use it
- Match strategy to actual requirements — not all workloads need canary.
- Account for DB migrations separately from code deploys.
- For canary, require metric gating; don’t just timer-based.
- Test rollback in non-prod; it’s the path you’ll need under pressure.
Pipeline shapes
Rolling (Kubernetes Deployment, default)
stages: [build, deploy, verify]
build:
stage: build
script:
- docker build -t "$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA" .
- docker push "$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"
deploy:
stage: deploy
script:
- kubectl set image deploy/web web="$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"
- kubectl rollout status deploy/web --timeout=10m
environment:
name: production
deployment_tier: production
rules:
- if: $CI_COMMIT_TAG
verify:
stage: verify
needs: [deploy]
script:
- ./smoke-tests.sh
rules:
- if: $CI_COMMIT_TAG
Blue/Green (Kubernetes via Argo Rollouts)
stages: [build, deploy-green, switch-traffic, cleanup]
build:
stage: build
script: ./build.sh
deploy-green:
stage: deploy-green
script:
- kubectl argo rollouts set image web web="$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"
- kubectl argo rollouts wait web --for=updated # wait for green pods ready
environment: { name: production-green }
smoke-test-green:
stage: deploy-green
needs: [deploy-green]
script:
- curl -v https://green.example.com/healthz
- ./smoke-tests.sh https://green.example.com
switch-traffic:
stage: switch-traffic
needs: [smoke-test-green]
script:
- kubectl argo rollouts promote web # switches traffic from blue to green
environment: { name: production }
when: manual # or after smoke-test passes
cleanup-blue:
stage: cleanup
needs: [switch-traffic]
script:
- sleep 1800 # 30 min rollback window
- kubectl argo rollouts retain web --reduce
when: manual
Canary (Kubernetes via Argo Rollouts + Prometheus)
stages: [build, canary, observe, promote-or-rollback]
deploy-canary:
stage: canary
script:
- kubectl argo rollouts set image web web="$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"
# Rollout spec defines steps: 10% → wait → 25% → wait → 50% → wait → 100%
environment: { name: production }
observe:
stage: observe
needs: [deploy-canary]
script:
- sleep 600 # 10 min observation
- ./check-slo.sh
# check-slo.sh queries Prometheus for error_rate < 0.5% and p99_latency < 500ms
# exits non-zero on threshold breach
promote:
stage: promote-or-rollback
needs: [observe]
script:
- kubectl argo rollouts promote web --full
when: on_success
environment: { name: production }
rollback:
stage: promote-or-rollback
needs: [observe]
script:
- kubectl argo rollouts abort web
when: on_failure
Canary with AWS Lambda
deploy-lambda-canary:
stage: deploy
script:
# Publish new version
- VERSION=$(aws lambda publish-version --function-name myfunc --query Version --output text)
# Update alias with 10/90 split
- aws lambda update-alias --function-name myfunc --name prod \
--function-version $VERSION \
--routing-config "AdditionalVersionWeights={$VERSION=0.1}"
environment: { name: production }
promote-lambda:
stage: promote
needs: [deploy-lambda-canary]
script:
# After observation, route 100% to new
- aws lambda update-alias --function-name myfunc --name prod \
--function-version $LATEST_VERSION \
--routing-config "AdditionalVersionWeights={}"
when: manual
DB migration pattern (safe for any strategy)
stages: [migrate, deploy, cleanup-migrations]
# Phase 1: Additive migration BEFORE code deploy (backward-compatible)
migrate:
stage: migrate
script:
- alembic upgrade head # adds new columns, keeps old
rules:
- if: $CI_COMMIT_TAG
# Phase 2: Deploy new code (reads/writes both old and new schema)
deploy:
stage: deploy
needs: [migrate]
script: ./deploy.sh
# Phase 3: Cleanup (drop old columns, after all old code is gone) — SEPARATE MR
# This runs in a future pipeline, not the same one
Comparison
| Aspect | Rolling | Blue/Green | Canary |
|---|---|---|---|
| Rollback speed | Slow (re-deploy) | Fast (flip back) | Fast (flip back) |
| Infra cost | 1× | 2× during switch | 1.1× during canary |
| Risk | Medium (some users hit new immediately) | Low (atomic switch) | Lowest (small % first) |
| Complexity | Low | Medium | High |
| Best for | Most stateless workloads | Database-heavy, stateful | High-risk changes |
| Requires | Pod replacement support | Two-target infra | Traffic split (LB/mesh) |
Common findings this catches
- Canary skipped to 100% on metric blip → threshold too tight or noisy; tune SLO query.
- Blue/green with shared DB and breaking migration → green crashes; flipping back doesn’t help.
- Rolling deploy stuck because
maxUnavailable: 0+maxSurge: 0→ impossible math. - Blue/green flip leaves blue running indefinitely → cleanup not running; verify the cleanup job ran.
- Canary observation manual-only → engineer-dependent; automate with metric gate.
- Lambda canary on alias used by sync clients without retries → 10% see errors; rollback fast.
When to escalate
- Strategy choice doesn’t fit infra capabilities — coordinate with platform team; may need LB / mesh changes.
- DB migration ordering issues — engage DBA team; backward-compat may require multi-deploy plan.
- Metric-gated rollback false-positives — SRE team for SLO tuning.
Related prompts
-
GitLab Environments & Deployments Debug Prompt
Diagnose GitLab environments — stuck deployments, environment scope, `stop_in` cleanup, protected environments, deployment tier confusion.
-
GitLab CI/CD → Kubernetes Deploy Patterns Prompt
Design GitLab CI/CD pipelines that deploy to Kubernetes — kubectl vs Helm vs Kustomize, secrets handling, multi-environment promotion, GitOps comparison.
-
Kubernetes Deployment Rollout Debug Prompt
Diagnose stuck Deployment rollouts — `ProgressDeadlineExceeded`, replica set churn, maxSurge/maxUnavailable misconfig, image pull pacing, and stuck-mid-rollout recovery.