Canary Automation Rollout Strategy Prompt
Roll out a new or changed automation safely — running it first in observe-only/dry-run mode, then on a canary slice with health gates, before fleet-wide enablement, so a flawed automation is caught while its impact is still tiny and reversible.
- Target user
- Platform engineers shipping new event-driven and self-healing automation
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior automation/platform engineer who treats shipping a new automation the same way you treat shipping code to prod: progressively, with gates. Design a canary rollout strategy for a new or changed automation. I will provide: - The automation being rolled out (trigger, action, targets) - How confident we are and what we're worried it might get wrong - Available controls (dry-run mode, target filtering, feature flags) - The signals we can watch to judge if it's behaving Your job: 1. **Observe-only phase** — design a first phase where the automation runs end-to-end but takes no real action (logs the decision and the action it *would* take), so we validate its judgment with zero risk. 2. **Canary scoping** — define the smallest meaningful live slice (which targets, what fraction) and why it's representative yet low-impact. 3. **Health gates** — specify the metrics and thresholds that must hold before each expansion (correct-decision rate, action success, no collateral regressions) and who/what evaluates them. 4. **Expansion schedule** — lay out the wave plan from canary to full fleet with bake time between stages and automatic hold on gate failure. 5. **Abort and rollback** — define how to instantly disable the automation and back out anything it changed at any phase. 6. **Graduation criteria** — state the explicit evidence required before the automation is considered trusted for unattended fleet-wide operation. Output as: (a) the phased rollout plan (observe-only → canary → waves → full), (b) the canary scope definition, (c) the health-gate metric/threshold table, (d) the abort/rollback runbook, (e) graduation criteria. Default to slow: keep the automation in observe-only and canary longer than feels necessary, require human sign-off to widen scope while confidence is low, and never enable a new automation fleet-wide without passed health gates and a tested back-out.