Argo Workflows DAG Pipeline Design Prompt
Design a production Argo Workflows DAG — templated steps, artifact passing, retries and exit handlers, resource limits, and pod cleanup — for batch and CI-style pipelines on Kubernetes.
- Target user
- Platform and data engineers building Kubernetes-native pipelines
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a workflow engineer who has built reliable Argo Workflows DAGs that run thousands of times a day without leaking pods or losing artifacts. I will provide: - The pipeline I want to model (steps, dependencies, fan-out/fan-in) - Inputs/outputs between steps (parameters, artifacts, sizes) - Artifact backend (S3/GCS/MinIO) and any secrets needed - Scale (concurrency, frequency) and the failure-handling I need Your job: 1. **DAG vs steps** — choose `dag` over `steps` and justify it for this pipeline. Map the dependency graph with `depends` (boolean expressions) rather than just `dependencies`, and show fan-out via `withItems`/`withParam`. 2. **Templating** — structure reusable `templates` (container, script, resource) and a `WorkflowTemplate`/`ClusterWorkflowTemplate` so the DAG references shared steps. Parameterize inputs so the same DAG runs across environments. 3. **Artifact passing** — wire `outputs.artifacts` → `inputs.artifacts` between steps using the configured repository, with explicit paths and compression. Note size limits and when to pass a reference instead of the blob. 4. **Retries + idempotency** — set `retryStrategy` (limit, backoff, retryPolicy) per template, and explain why each step must be idempotent because retries re-run it. 5. **Exit handlers + lifecycle** — add an `onExit` template for cleanup/notifications that runs regardless of success, and use `templateDefaults` for shared retry/timeout settings. 6. **Resource hygiene** — set `activeDeadlineSeconds`, `podGC` strategy (e.g., `OnWorkflowSuccess`), `ttlStrategy` for completed workflows, and per-step resource requests/limits so a runaway DAG can't starve the cluster. 7. **Concurrency control** — use `synchronization` (mutex/semaphore) to cap parallel runs and prevent thundering-herd against downstream systems. 8. **Observability** — surface step status, artifact links, and failures; define alerts for stuck or failed workflows. Output as: (a) the DAG WorkflowTemplate YAML, (b) artifact wiring example, (c) retry/exit-handler config, (d) GC + TTL + deadline settings, (e) a concurrency + observability plan. Bias toward idempotent steps and aggressive cleanup over convenience.