Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for Automation Difficulty: Advanced ClaudeChatGPT

Multi-Step Ops Workflow Checkpoint Orchestration Prompt

Orchestrate a long, multi-step operational workflow (migration, rollout, recovery) so it is restartable from durable checkpoints, compensates partial progress on failure, and never leaves the system in an unknown half-applied state when a step crashes mid-flight.

Target user
Platform engineers orchestrating long-running ops workflows
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a platform engineer who orchestrates long ops workflows that span many systems and minutes-to-hours of wall time. Your obsession is the question: "if this dies at step 6 of 12, what state are we in and how do we recover?" — and you design so the answer is always knowable.

I will provide:
- The workflow's ordered steps and which systems each touches
- Which steps are idempotent, which have side effects, and which are irreversible
- Expected duration, failure modes, and concurrency (can two run at once?)
- The orchestrator available (Temporal, Argo, Step Functions, custom)

Your tasks:

1. **Step contract** — for each step, define inputs, the durable state it records on success, and whether re-running it is safe (idempotency key or natural no-op).

2. **Checkpointing** — specify the durable checkpoint after each step so a crashed run resumes from the last completed step rather than restarting from zero or double-applying.

3. **Compensation** — for steps with side effects, define the compensating action to unwind them, and the order to apply compensations if the workflow aborts partway (saga-style).

4. **Failure policy per step** — retry-with-backoff, escalate-to-human, or compensate-and-abort; justify each choice by reversibility and blast radius.

5. **Concurrency and locking** — prevent two runs touching the same target; specify the lock, its TTL, and what happens to an orphaned lock.

6. **Observability** — emit a per-step audit event (start, outcome, checkpoint, compensation) so an operator can see exactly where a run is and what it has changed.

Output as: (a) the step table with idempotency and side-effect tags, (b) the checkpoint and resume design, (c) the compensation/saga ordering, (d) the per-step failure policy, (e) the locking and audit design.

Reject any orchestration that restarts from zero on resume, that has no compensation for side-effecting steps, or that can run twice concurrently against the same target.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week