Progressive Delivery in GitLab CI: Canary and Blue-Green Deploys
Big-bang deploys are how you get paged. Here is how I build canary and blue-green rollouts in GitLab CI, with AI drafting the weight-shifting logic safely.
- #gitlab
- #ci-cd
- #deployments
- #kubernetes
The deploy that taught me about progressive delivery rolled out a new release to 100% of traffic at once, hit a memory leak that only showed under real load, and took down the whole service in four minutes. If I’d shifted 10% of traffic first and watched it, the blast radius would have been a tenth the size. GitLab CI doesn’t have a magic “canary” button, but it gives you everything you need to build canary and blue-green flows by hand — and the YAML, which is the tedious part, is exactly where AI earns its keep. Here’s how I structure both.
Canary vs. blue-green, briefly
Canary sends a small slice of traffic (5–10%) to the new version, holds, watches metrics, then ramps up. Blue-green stands up a full second environment (green), tests it, then flips all traffic from the old (blue) to the new in one switch — with the old kept warm for instant rollback. Canary minimizes blast radius; blue-green minimizes rollback time. I pick based on whether I fear bad releases more than I fear slow rollbacks.
A canary pipeline shape
The pattern is: deploy canary, pause for observation, then either promote or abort. In GitLab terms that’s three jobs with manual gates:
stages: [build, canary, promote, cleanup]
deploy-canary:
stage: canary
image: alpine/helm:3.15
script:
- helm upgrade --install api-canary ./charts/api
--set image.tag="$CI_COMMIT_SHA"
--set replicaCount=1
--set canary.weight=10
--namespace production
environment:
name: production/canary
url: "https://app.example.com"
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
promote-canary:
stage: promote
needs: ["deploy-canary"]
script:
- helm upgrade --install api ./charts/api
--set image.tag="$CI_COMMIT_SHA"
--set canary.weight=100
--namespace production
environment:
name: production
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
when: manual
abort-canary:
stage: promote
needs: ["deploy-canary"]
script:
- helm rollback api-canary --namespace production || true
- helm uninstall api-canary --namespace production
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
when: manual
deploy-canary ships 10% weight automatically. Then a human (or a metrics check, below) decides between promote-canary and abort-canary. Both are manual so the pipeline waits at the decision point instead of barreling forward.
Making the canary decision data-driven
A human staring at a Grafana tab is fine, but you can do better by querying your metrics backend and failing the gate automatically if error rate spikes:
canary-analysis:
stage: promote
needs: ["deploy-canary"]
image: curlimages/curl:8.8.0
script:
- |
ERRORS=$(curl -s "$PROM_URL/api/v1/query" \
--data-urlencode 'query=sum(rate(http_requests_total{job="api-canary",status=~"5.."}[5m]))' \
| jq -r '.data.result[0].value[1] // "0"')
echo "Canary 5xx rate: $ERRORS"
if awk "BEGIN{exit !($ERRORS > 0.05)}"; then
echo "Error rate too high, failing canary"; exit 1
fi
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
Make promote-canary depend on canary-analysis passing. Now the promotion can’t happen if the canary is unhealthy. $PROM_URL is a CI/CD variable, not a hardcoded secret.
Pro Tip: Always give your canary a real hold period. A sleep 300 before analysis, or simply pausing on the manual gate, lets enough traffic flow to make the metrics meaningful. Five minutes of bad data analyzed instantly is worse than no analysis.
Blue-green with an environment swap
Blue-green stands up the inactive color, smoke-tests it, then repoints the service selector:
stages: [build, deploy-green, switch, decommission]
deploy-green:
stage: deploy-green
script:
- helm upgrade --install api-green ./charts/api
--set image.tag="$CI_COMMIT_SHA"
--set color=green
--namespace production
- ./smoke-test.sh https://green.internal.example.com
environment:
name: production/green
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
switch-traffic:
stage: switch
needs: ["deploy-green"]
script:
- kubectl patch service api -p '{"spec":{"selector":{"color":"green"}}}' -n production
environment:
name: production
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
when: manual
decommission-blue:
stage: decommission
needs: ["switch-traffic"]
script:
- helm uninstall api-blue --namespace production || true
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
when: manual
The traffic switch is a single kubectl patch of the service selector — instant cutover. Crucially, decommission-blue is a separate, manual job. Keep blue alive until you’re confident; rolling back is just patching the selector back. Don’t combine switch and decommission, or you lose your safety net.
Rollback is the feature
The entire point of progressive delivery is that rollback is fast and boring. For canary: run abort-canary. For blue-green: patch the selector back to blue. I write both rollback paths before I write the forward path, because the rollback is what saves me at 3am. A deploy strategy without a tested rollback isn’t a strategy, it’s a hope.
Stopping a stale canary with environment actions
A subtle failure mode: you ship a canary, get distracted, and a newer pipeline ships its own canary on top — now two canary versions are live and your metrics are meaningless. GitLab’s environment on_stop actions help by defining a teardown job that GitLab can trigger to clean up an environment automatically:
deploy-canary:
stage: canary
script:
- helm upgrade --install api-canary ./charts/api --set canary.weight=10 -n production
environment:
name: production/canary
on_stop: stop-canary
stop-canary:
stage: canary
variables:
GIT_STRATEGY: none
script:
- helm uninstall api-canary --namespace production || true
environment:
name: production/canary
action: stop
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
when: manual
The action: stop job tears the canary down cleanly, and pairing it with on_stop means the environment has a defined “off” state instead of lingering forever. The GIT_STRATEGY: none matters — a stop job doesn’t need a checkout and shouldn’t fail if the branch is gone. I treat every progressive-delivery environment as something that must be stoppable, not just startable; a deploy you can’t cleanly tear down is a deploy you can’t safely iterate on.
Where AI fits — and where it doesn’t
This YAML is verbose and repetitive across the two colors or the promote/abort pair, which makes it perfect for AI drafting. It reliably generates the job skeletons, the Helm flags, and the Prometheus query syntax. It’s a fast junior engineer: quick, broadly correct, occasionally confidently wrong.
Where it’s wrong: it will sometimes wire promote to run automatically after the canary instead of gating it, quietly defeating the whole design. It also tends to forget that the decommission step should be manual. And it doesn’t know your real error-rate threshold — 5% might be wildly wrong for your service. So I read every gate and every threshold before merge.
The hard rule stays hard: never paste your kubeconfig, $PROM_URL token, or registry creds into a chat to debug a rollout. Share the job YAML and the logs. When a canary goes sideways mid-rollout, lean on your runbook — the incident response dashboard is built for that scramble — and confirm health afterward via the monitoring alerts dashboard.
My reusable prompt: “Draft a GitLab CI canary deploy: ship 10% weight automatically, then a metrics-analysis job that queries Prometheus for 5xx rate and fails if above a threshold, then a manual promote and a manual abort. Keep promote gated behind the analysis. Flag every threshold I need to tune.” Starter versions are in my prompt library and the deployment prompt packs.
Conclusion
GitLab CI gives you the primitives — environments, manual gates, needs, and scripts — to build real canary and blue-green deploys without a fancy add-on. Canary shrinks blast radius; blue-green shrinks rollback time; both demand a tested rollback path written first. Let AI draft the verbose YAML, then verify your gates stay gated, your thresholds are real, and your secrets never leave your settings. More in the GitLab CI/CD category.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.