Argo Workflows for Ops Pipelines: Robust DAGs With AI Help

The first Argo Workflow I shipped to production worked beautifully in the demo and orphaned a cloud volume the first time it failed in anger. Step two created a temporary EBS volume. Step three processed it. Step four — the cleanup — deleted it. When step three failed, the workflow stopped, step four never ran, and the volume sat there accruing charges until a cost alert found it three weeks later. The bug was not in any single step. It was in my mental model: I had treated the DAG’s last node as cleanup, when cleanup needs to run on failure, which is the one path the last node never sees.

Argo Workflows is a superb engine for Kubernetes-native ops pipelines — batch jobs, data processing, multi-step deployments — and most of its sharp edges are exactly this kind: things that work on the happy path and bite on the failure path. AI is genuinely useful for drafting Argo manifests, because the YAML is verbose and the patterns are well-known. But the engine’s failure semantics are where you have to stay in the driver’s seat.

Model the DAG, Then Model Its Failures

A dag template expresses steps and their dependencies, and Argo runs independent steps in parallel automatically. The drafting is mechanical, which is why handing it to a model works well:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: provision-and-process-
spec:
  entrypoint: main
  onExit: cleanup            # runs on success AND failure — this is the fix
  podGC:
    strategy: OnWorkflowCompletion
  templates:
    - name: main
      dag:
        tasks:
          - name: provision
            template: provision-volume
          - name: process
            template: process-data
            dependencies: [provision]
          - name: publish
            template: publish-results
            dependencies: [process]

The load-bearing line is onExit: cleanup. An exit handler runs whether the workflow succeeds or fails, which is the only place cleanup belongs. Putting teardown in a final DAG task — the mistake I made — means it silently never runs when an earlier step fails, which is precisely when an orphaned resource is created. When you ask a model to draft an Argo pipeline, this is the first thing to check in its output, because the naive draft almost always appends cleanup as a node rather than wiring it as an exit handler.

Bound Every Retry and Deadline

Each template should carry its own retry and timeout policy. The danger of leaving these off is a step that retries forever, pinning pods and consuming the cluster while looking, from a distance, like it’s “still working.”

    - name: process-data
      retryStrategy:
        limit: "3"
        retryPolicy: "OnError"      # don't retry on deliberate Failed
        backoff:
          duration: "30s"
          factor: "2"
      activeDeadlineSeconds: 600
      container:
        image: myorg/processor:1.4.2
        command: [process]

Two distinctions matter here that a model will not make for you unless told. First, retryPolicy: OnError retries infrastructure errors (a node died) but not a container that exited non-zero on purpose — you usually do not want to retry a step that deliberately failed a validation. Second, activeDeadlineSeconds is the backstop for a step that hangs without erroring at all, which no retry count would ever catch. Ask the model to set both on every template and to justify the retry policy per step; then sanity-check that a step which should fail loudly is not quietly being retried into eventual success.

Prompt: “Here is my Argo Workflow with five steps; step two creates a temp resource and step five was meant to clean it up. Convert cleanup into an onExit handler, add per-template retryStrategy and activeDeadlineSeconds, and produce a failure-mode table mapping each step to its retry and cleanup behavior. Flag any step whose retry could re-apply a non-idempotent side effect.”

What it returns: a corrected manifest with onExit cleanup, retry/deadline blocks per template, and a table — and, when prompted well, a callout that the provisioning step’s retry would create a second resource unless keyed on a stable name. That callout is the whole point.

Idempotency Is Not Optional Under Retries

Because Argo re-runs a failed step from its retry strategy, every step with a side effect must be idempotent. A provisioning step that retries must not create a second resource; key it on a stable, deterministic name derived from the workflow inputs, so the second attempt finds the resource the first attempt made and proceeds. This is the same reasoning behind idempotency keys in API and webhook automation — a retry has to be safe by construction, not by luck. If a step genuinely cannot be made idempotent, it should not be inside a retried template; gate it behind a suspend step and a human, or move it out of the automatic-retry path entirely.

For irreversible actions, add a manual approval gate. Argo’s suspend template pauses the workflow until a human resumes it, which is the right primitive for “do not delete the production database without someone confirming.” Pair this with the broader patterns in Temporal saga compensation when a multi-step pipeline needs to unwind partial work rather than just stop.

Verify by Failing on Purpose

The discipline that would have saved my orphaned volume is simple: before promoting a workflow, run it end-to-end in a non-production namespace with --dry-run first to catch manifest errors, then for real with synthetic inputs — and deliberately fail a middle step. Watch the exit handler fire. Confirm the temporary resource is gone. Confirm a retried step did not duplicate its side effect. None of this shows up when every step succeeds, which is the only scenario a casual test exercises.

The division of labor is clear. The model drafts the DAG, the retry blocks, and the failure-mode table faster and more completely than you would by hand. You own the three things it cannot infer: that cleanup belongs in onExit, that retries must be idempotent, and that the only proof of either is a deliberately failed run. For the design-side checklist, see the Argo Workflows DAG pipeline prompt and the rest of the AI for Automation library.