GitLab CI/CD retry:when Exit-Code Retry Policy Matrix Prompt
Design a precise retry policy using retry:when and retry:exit_codes so transient infrastructure failures auto-retry while genuine test or compile failures fail fast — instead of blanket retry: 2 that masks real bugs and burns minutes.
- Target user
- Engineers who want resilient pipelines without hiding real failures
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a GitLab CI reliability engineer who tunes retry behavior so pipelines recover from transient infrastructure faults but never silently re-run a job that failed for a real reason. I will provide: - The job(s) and what they do (test, build, deploy, integration) - The failure classes I actually see (runner system failure, image pull timeout, network blip, OOM, real test failure, compile error) - My current retry config (often a blanket `retry: 2`) Your job: 1. **Classify failures** — map each failure I describe to a GitLab `retry:when` value: `runner_system_failure`, `stuck_or_timeout_failure`, `api_failure`, `scheduler_failure`, `data_integrity_failure`, `unknown_failure`, `script_failure`, `archived_failure`. Be explicit that `script_failure` is the dangerous one (it covers real test/compile failures) and should rarely be retried. 2. **Build the retry block** — write a job-level `retry:` with `max:` and a `when:` list that retries ONLY transient classes. Show the YAML. 3. **Exit-code targeting** — where a tool returns a deterministic transient exit code (e.g., [TOOL] returns [CODE] on a flaky dependency fetch), add `retry:exit_codes:` so retries are scoped to that signal rather than all script failures. 4. **Backoff reality** — explain that GitLab retries immediately with no built-in backoff; if I need backoff for a flaky external API, show a `before_script`/wrapper pattern that sleeps and re-checks before the job itself fails. 5. **Anti-masking guardrail** — for test jobs, recommend retrying at most transient classes and surfacing a warning so flaky tests get quarantined, not hidden. Tie this to a follow-up step. 6. **Verify** — give me a way to confirm the policy: trigger a forced transient failure and a forced real failure, and state the expected retry count for each. Output: (a) the failure→`when` mapping table, (b) the per-job `retry:` blocks with `max`, `when`, and `exit_codes` where relevant, (c) the backoff wrapper if needed, (d) the verify procedure. Bias toward: never retrying `script_failure` blindly; scoping retries to transient classes and specific exit codes.
Why this prompt works
Most teams reach for retry: 2 the first time a pipeline fails for a reason that wasn’t their fault — a runner died, an image pull timed out, a registry hiccuped. The problem is that a bare retry: count retries every failure class, including script_failure, which is exactly the class that means “your tests or your build actually failed.” That turns retry into a bug-masking machine: a real failure that happens to pass on a flaky re-run sails into main, and your compute bill quietly doubles. This prompt forces the model to start from failure classification rather than a blanket count, which is the only way to get retry behavior that helps reliability without hurting correctness.
The prompt is built around GitLab’s actual retry:when vocabulary and the newer retry:exit_codes feature, so the output maps cleanly onto real config instead of vague advice. By making the model produce a failure→class table first, you get a paper trail explaining why each retry condition is allowed, which is invaluable when a teammate later asks why a deploy job retries on api_failure but a unit-test job does not. The exit-code targeting step is the precision tool: when you know a specific external dependency returns a deterministic transient code, you can retry that and nothing else.
It also closes the two gaps that trip people up in practice. First, GitLab has no native retry backoff, so naively retrying a job against a still-degraded dependency just burns three immediate attempts — the prompt makes the model address that explicitly with a wrapper. Second, retrying test jobs can hide flakiness forever, so the guardrail pushes flaky tests toward quarantine instead of permanent silent retries. The result is a retry policy you can defend in review, not a reflex you’ll regret during the next incident.