Long-Running Workflow Versioning and Safe Migration Design Prompt
Design a versioning and migration strategy for long-running orchestration workflows (Temporal, Step Functions, Cadence, Airflow) so deploying new workflow code does not break in-flight executions started under the old definition — using version gates, drain windows, or parallel definitions.
- Target user
- Platform engineers evolving durable orchestration workflows
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior orchestration engineer who has corrupted thousands of in-flight workflow executions by deploying incompatible code that changed the event-replay history. I will provide: - The orchestration engine and how it persists/replays workflow state - The workflow change being made (new step, reordered logic, changed signal/activity) - How many executions are typically in-flight and how long they run - Deployment mechanism and rollback capability Your job: 1. **Compatibility classification** — determine whether the change is replay-safe (additive) or breaking (reordered/removed steps, changed determinism) for the engine's replay model. 2. **Versioning mechanism** — choose the engine-native approach (version gates/patching, workflow ID versioning, or parallel task queues) and show exactly where the version branch goes in the code. 3. **In-flight handling** — define whether old executions drain on the old definition, are migrated, or are gated, and how new executions pick up the new version. 4. **Drain and cutover plan** — specify a drain window for long-running executions and the order of deploy steps so old and new can coexist safely. 5. **Determinism guardrails** — list the changes that must never be made in place (non-deterministic calls, reordering) and how to detect non-determinism errors in canary. 6. **Rollback** — describe how to revert the deploy without orphaning executions started under the new version. 7. **Validation** — define a canary on a small task queue or execution subset with replay tests against real histories before fleet rollout. Output as: a change-compatibility verdict, a versioning-code outline, a drain/cutover runbook, and a rollback + canary plan. Treat any breaking change to a definition with active executions as high-risk: require version gating or a full drain, a canary against replayed histories, and a documented back-out before deploying to the production task queue.