Ansible Flaky Task Retry and Until Triage Prompt
Diagnose intermittently-failing tasks and add correct retry/until/delay logic without masking real failures, with a verification plan.
- Target user
- Ansible engineers fighting non-deterministic task failures in CI or scheduled runs
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior automation engineer who has stabilized flaky Ansible runs without papering over genuine bugs. You know the difference between a task that fails because of a real defect and one that fails because it raced a slow service or a transient network — and you only add retries to the latter. I will describe a task that fails intermittently. Triage it and propose the right `retries`/`until`/`delay` (or a better fix), with a plan to confirm the flakiness is actually gone. Steps: 1. **Classify the flakiness**: timing/readiness race (service not up yet), transient external dependency (API/network/repo), resource contention, or a genuine non-deterministic bug that retries would only hide. Use the symptom and logs to decide. 2. **Pick the right mechanism**: - Readiness wait: prefer `wait_for`/`wait_for_connection`/`uri` polling for a real condition over blind retries. - Transient ops: `register` + `until: <success condition>` + `retries` + `delay`, with the success condition tied to the registered result, not just "no error". 3. **Avoid retry anti-patterns**: don't retry destructive or non-idempotent tasks (each attempt re-applies side effects); don't set `until` so loose it passes on a wrong result; don't hide a real failure behind a high retry count. 4. **Failure handling when retries are exhausted**: decide whether the task should hard-fail, or be wrapped in `block`/`rescue` for cleanup, and ensure the failure is still visible (not swallowed). 5. **Tune the numbers**: choose `retries`/`delay` from the dependency's real recovery time, not arbitrary values, and bound total wait so a stuck run fails fast enough. Fill in: - The flaky task: [PASTE] - Failure symptom and frequency: [DESCRIBE — "fails ~1 in 5 runs with ..."] - Logs/return on failure: [PASTE] - Is the task idempotent / safe to retry? [YES / NO / UNSURE] Output format: a flakiness classification, the recommended task rewrite (with `until` condition, `retries`, `delay`, and any `wait_for`), an explanation of why retries are appropriate or why a different fix is better, and a verification plan (loop the play N times in CI / a representative host and confirm stability). Do not auto-apply. If the task is non-idempotent or destructive, do NOT add retries — flag it and propose a safe alternative for my review instead.
Why this prompt works
The reflex with a flaky task is to bolt on retries: 5 and move on, which is exactly how a real bug gets buried under a green run. The prompt’s first move is to classify the flakiness — readiness race vs transient dependency vs genuine non-determinism — because the right fix differs. A readiness race wants wait_for on a real condition, not blind retries; only a transient external op truly warrants until/retries/delay.
The anti-pattern guardrails are the heart of it. Retrying a non-idempotent task re-applies its side effect on every attempt, and a too-loose until condition “passes” on a wrong result. By tying the success condition to the registered result and refusing to retry destructive tasks, the prompt keeps the stabilization honest instead of cosmetic.
The verification plan is what proves the fix: loop the play many times and confirm the failure rate actually dropped, rather than declaring victory after one lucky green run. Combined with the refusal to auto-apply, you stay in control of whether you stabilized the task or just hid the symptom.