GitLab CD: Blue/Green, Canary & Rolling Deployment Patterns

You are a senior DevOps engineer who has built blue/green, canary, and rolling deployment pipelines in GitLab CI/CD for production workloads on Kubernetes, VMs, and serverless platforms. You know the trade-offs and the pipeline shapes for each strategy. I will provide: - The deployment target (Kubernetes / EC2 ASG / serverless / VM fleet) - The current rollout strategy (or "none" if direct-replace) - The risk tolerance / rollback requirement - The application architecture (stateless / stateful, has DB migrations, uses cache) - The goal: design a new strategy / debug an existing canary / pick between strategies Your job: 1. **Match strategy to need**: - **Rolling** — replaces pods/instances gradually; cheapest; built into K8s/ASG; default for most cases - **Blue/Green** — keeps old version (blue) while deploying new (green); instant rollback; doubles infra cost during switch - **Canary** — routes small % of traffic to new version; observe metrics; promote or rollback; requires traffic split mechanism 2. **Design the pipeline stages** per strategy: - **Rolling**: build → deploy (kubectl set image / helm upgrade) → smoke test → done - **Blue/Green**: build → deploy-green (alongside blue) → smoke test green → switch-traffic → keep-blue-for-rollback → cleanup-blue - **Canary**: build → deploy-canary (10% traffic) → observe metrics 10 min → if pass: promote-to-100% → if fail: rollback 3. **For Kubernetes targets**: - Rolling is native to Deployment (`maxSurge`/`maxUnavailable`) - Blue/green: two Deployments + Service selector switch (Istio VirtualService, Argo Rollouts) - Canary: Argo Rollouts, Flagger, or Istio VirtualService with weighted routing 4. **For VM / ASG**: - Rolling: ASG `MinHealthyPercentage` controls - Blue/green: two target groups; flip LB - Canary: weighted target groups (AWS ALB), or DNS-based (Route 53 weighted records) 5. **For serverless**: - AWS Lambda: alias with traffic shifting (canary/linear/all-at-once) - GitLab CI deploys alias with new version + traffic config 6. **Critical considerations**: - **DB migrations** — never deploy a new schema in a strategy that keeps the old version running unless migration is backward-compatible (additive only). Otherwise: deploy migration first → deploy new code → drop old fields LATER. - **Stateful workloads** — blue/green is hard; data syncing during switch - **Cache invalidation** — new version with stale cache may misbehave - **Long connections (WebSocket, gRPC streams)** — drain time during blue/green switch 7. **Rollback strategy per type**: - Rolling: `kubectl rollout undo` / `helm rollback` - Blue/Green: switch traffic back to blue (fast) - Canary: revert traffic split to 100% old (fast) 8. **For monitoring + automated rollback** (canary): - Define SLO thresholds: error rate < 0.5%, p99 latency < 500ms - Use Prometheus query in pipeline to gate promote - Tools: Flagger (K8s), Spinnaker, GitLab's auto-rollout (limited) Mark DESTRUCTIVE: traffic switch without smoke test (production exposure), removing blue infra immediately after green deploy (no rollback), DB migration that breaks old version while old code is still serving. --- Target platform: [K8s / ASG / serverless / VM fleet] Current strategy: [direct-replace / rolling / blue-green / canary / none] Risk tolerance: [low / medium / high] Schema migration?: [yes / no] Stateful workload?: [yes / no] Goal: [design new / debug existing / choose between]

Why this prompt works

Choosing a deployment strategy is half design, half pipeline implementation. Each strategy has a specific pipeline shape and rollback mechanism. This prompt forces a strategy-first design rather than copying YAML from elsewhere.

How to use it

Match strategy to actual requirements — not all workloads need canary.
Account for DB migrations separately from code deploys.
For canary, require metric gating; don’t just timer-based.
Test rollback in non-prod; it’s the path you’ll need under pressure.

Pipeline shapes

Rolling (Kubernetes Deployment, default)

stages: [build, deploy, verify]

build:
  stage: build
  script:
    - docker build -t "$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA" .
    - docker push "$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"

deploy:
  stage: deploy
  script:
    - kubectl set image deploy/web web="$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"
    - kubectl rollout status deploy/web --timeout=10m
  environment:
    name: production
    deployment_tier: production
  rules:
    - if: $CI_COMMIT_TAG

verify:
  stage: verify
  needs: [deploy]
  script:
    - ./smoke-tests.sh
  rules:
    - if: $CI_COMMIT_TAG

Blue/Green (Kubernetes via Argo Rollouts)

stages: [build, deploy-green, switch-traffic, cleanup]

build:
  stage: build
  script: ./build.sh

deploy-green:
  stage: deploy-green
  script:
    - kubectl argo rollouts set image web web="$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"
    - kubectl argo rollouts wait web --for=updated  # wait for green pods ready
  environment: { name: production-green }

smoke-test-green:
  stage: deploy-green
  needs: [deploy-green]
  script:
    - curl -v https://green.example.com/healthz
    - ./smoke-tests.sh https://green.example.com

switch-traffic:
  stage: switch-traffic
  needs: [smoke-test-green]
  script:
    - kubectl argo rollouts promote web    # switches traffic from blue to green
  environment: { name: production }
  when: manual    # or after smoke-test passes

cleanup-blue:
  stage: cleanup
  needs: [switch-traffic]
  script:
    - sleep 1800    # 30 min rollback window
    - kubectl argo rollouts retain web --reduce
  when: manual

Canary (Kubernetes via Argo Rollouts + Prometheus)

stages: [build, canary, observe, promote-or-rollback]

deploy-canary:
  stage: canary
  script:
    - kubectl argo rollouts set image web web="$CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"
    # Rollout spec defines steps: 10% → wait → 25% → wait → 50% → wait → 100%
  environment: { name: production }

observe:
  stage: observe
  needs: [deploy-canary]
  script:
    - sleep 600  # 10 min observation
    - ./check-slo.sh
    # check-slo.sh queries Prometheus for error_rate < 0.5% and p99_latency < 500ms
    # exits non-zero on threshold breach

promote:
  stage: promote-or-rollback
  needs: [observe]
  script:
    - kubectl argo rollouts promote web --full
  when: on_success
  environment: { name: production }

rollback:
  stage: promote-or-rollback
  needs: [observe]
  script:
    - kubectl argo rollouts abort web
  when: on_failure

Canary with AWS Lambda

deploy-lambda-canary:
  stage: deploy
  script:
    # Publish new version
    - VERSION=$(aws lambda publish-version --function-name myfunc --query Version --output text)
    # Update alias with 10/90 split
    - aws lambda update-alias --function-name myfunc --name prod \
        --function-version $VERSION \
        --routing-config "AdditionalVersionWeights={$VERSION=0.1}"
  environment: { name: production }

promote-lambda:
  stage: promote
  needs: [deploy-lambda-canary]
  script:
    # After observation, route 100% to new
    - aws lambda update-alias --function-name myfunc --name prod \
        --function-version $LATEST_VERSION \
        --routing-config "AdditionalVersionWeights={}"
  when: manual

DB migration pattern (safe for any strategy)

stages: [migrate, deploy, cleanup-migrations]

# Phase 1: Additive migration BEFORE code deploy (backward-compatible)
migrate:
  stage: migrate
  script:
    - alembic upgrade head    # adds new columns, keeps old
  rules:
    - if: $CI_COMMIT_TAG

# Phase 2: Deploy new code (reads/writes both old and new schema)
deploy:
  stage: deploy
  needs: [migrate]
  script: ./deploy.sh

# Phase 3: Cleanup (drop old columns, after all old code is gone) — SEPARATE MR
# This runs in a future pipeline, not the same one

Comparison

Aspect	Rolling	Blue/Green	Canary
Rollback speed	Slow (re-deploy)	Fast (flip back)	Fast (flip back)
Infra cost	1×	2× during switch	1.1× during canary
Risk	Medium (some users hit new immediately)	Low (atomic switch)	Lowest (small % first)
Complexity	Low	Medium	High
Best for	Most stateless workloads	Database-heavy, stateful	High-risk changes
Requires	Pod replacement support	Two-target infra	Traffic split (LB/mesh)

Common findings this catches

Canary skipped to 100% on metric blip → threshold too tight or noisy; tune SLO query.
Blue/green with shared DB and breaking migration → green crashes; flipping back doesn’t help.
Rolling deploy stuck because maxUnavailable: 0 + maxSurge: 0 → impossible math.
Blue/green flip leaves blue running indefinitely → cleanup not running; verify the cleanup job ran.
Canary observation manual-only → engineer-dependent; automate with metric gate.
Lambda canary on alias used by sync clients without retries → 10% see errors; rollback fast.

When to escalate

Strategy choice doesn’t fit infra capabilities — coordinate with platform team; may need LB / mesh changes.
DB migration ordering issues — engage DBA team; backward-compat may require multi-deploy plan.
Metric-gated rollback false-positives — SRE team for SLO tuning.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

GitLab CD: Blue/Green, Canary & Rolling Deployment Patterns Prompt