Ansible Error Guide: 'FAILED - RETRYING' Until Loop Retries Exhausted
Fix Ansible's 'FAILED - RETRYING ... (retries left)' loop that ends in failure: diagnose until/retries conditions, slow services, wrong success checks, and timeouts.
- #ansible
- #troubleshooting
- #errors
- #retries
Exact Error Message
FAILED - RETRYING: Wait for API to become healthy (5 retries left).
FAILED - RETRYING: Wait for API to become healthy (4 retries left).
FAILED - RETRYING: Wait for API to become healthy (3 retries left).
FAILED - RETRYING: Wait for API to become healthy (2 retries left).
FAILED - RETRYING: Wait for API to become healthy (1 retries left).
fatal: [app-01]: FAILED! => {
"attempts": 5,
"changed": false,
"msg": "Status code was 503 and not [200]: HTTP Error 503: Service Unavailable"
}
The FAILED - RETRYING lines are not the failure themselves — they are Ansible counting down a retries/until loop. The real failure is the final fatal: line after the attempts are exhausted.
What the Error Means
When a task has until, retries, and delay, Ansible runs it repeatedly until the until condition is true or the retry count is exhausted. Each unsuccessful pass prints FAILED - RETRYING ... (N retries left). If the until condition never becomes true within the allotted attempts, the task fails for real with attempts: N in the result.
So this error means: “I retried as instructed, and the success condition was still not met after the last attempt.” The cause is whatever the loop was waiting for never reaching the desired state in time.
Common Causes
- The thing being waited on (a service, port, file, or API) genuinely never came up in the retry window.
retriesxdelayis too short for a slow-starting service.- The
untilcondition is wrong or too strict (checking for an exact body, status, or value the service never returns). - A
registered variable is tested incorrectly (e.g.result.statusvsresult.status_code). - The probed endpoint returns a transient code (
503,502) the whole time because a dependency upstream is broken. - Facts about success are misread because
failed_when/changed_wheninteract unexpectedly withuntil.
How to Reproduce the Error
Poll an endpoint that stays unhealthy:
- name: Wait for API to become healthy
ansible.builtin.uri:
url: http://app-01:8080/health
status_code: 200
register: health
until: health.status == 200
retries: 5
delay: 3
ansible-playbook deploy.yml -i inventory.ini --check --diff -vvv
FAILED - RETRYING: Wait for API to become healthy (5 retries left).
...
fatal: [app-01]: FAILED! => {"attempts": 5, "msg": "Status code was 503 and not [200]: HTTP Error 503: Service Unavailable"}
Diagnostic Commands
Run the play verbosely so you see the full result of each attempt:
ansible-playbook deploy.yml -i inventory.ini --check --diff -vvv
Probe the same endpoint manually to see what it actually returns and how fast:
ansible app-01 -i inventory.ini -m uri -a "url=http://app-01:8080/health status_code=200,503 return_content=true"
Check the service’s real status on the host:
ansible app-01 -i inventory.ini -m command -a "systemctl status myapi --no-pager"
Inspect timing on the connection itself if it is a network/port wait:
ssh -v deploy@app-01 "curl -so /dev/null -w '%{http_code} %{time_total}\n' http://localhost:8080/health"
Step-by-Step Resolution
-
Look past the retry lines to the final
fatal:result. Themsgthere (status 503, timeout, no route) is the actual cause. -
Verify the success condition matches the data. For
uri, the field isstatus, notstatus_code, on the registered result; confirm with-vvvoutput. Fix theuntil:
until: health.status == 200
- Right-size the wait window. If the service needs 60s to start,
retries: 5, delay: 3(15s) is far too short. Increase appropriately:
retries: 30
delay: 5
- Decide if the dependency is the real problem. If the endpoint returns
503for the entire window, the service or its upstream is broken — fix that before tuning retries. Check logs:
ansible app-01 -i inventory.ini -m command -a "journalctl -u myapi --no-pager -n 50"
-
Loosen brittle checks. If you require an exact body that varies, match on status code or a substring instead.
-
Re-run and confirm the loop now succeeds within the window:
ansible-playbook deploy.yml -i inventory.ini --check --diff
ok: [app-01]
Prevention and Best Practices
- Size
retriesxdelayto the service’s realistic worst-case startup time, with headroom. - Prefer purpose-built waiters:
wait_forfor ports/files,wait_for_connectionfor reboots, rather than hand-rolleduntilloops where possible. - Make success conditions match the actual registered field. Run once with
-vvvto confirm field names before trusting them. - Treat persistent
503/502during the whole window as a dependency failure to fix, not a retry count to raise. - Add a clear
name:soFAILED - RETRYINGlines tell you exactly what is being waited on. - Surface the failing dependency in alerts; a wait loop exhausting retries usually means something upstream is down.
Related Errors
The conditional check '...' failed— theuntilexpression itself is invalid, not just unmet.Timeout when waiting for ...fromwait_for— a port/file waiter timed out rather than anuntilloop.Status code was 503 and not [200]— the innerurifailure that drives the retries.UNREACHABLE!— the host itself dropped, which can also exhaust connection retries.
Frequently Asked Questions
Is FAILED - RETRYING an error? Not by itself. It is normal progress output for a retry loop. Only the final fatal: after the attempts run out is the error.
My until never succeeds even though the service is up. The condition probably tests the wrong field. For uri, use result.status; for command, use result.rc. Confirm field names with -vvv.
How long does my loop actually wait? Roughly retries * delay seconds plus task execution time. With retries: 5, delay: 3 that is about 15 seconds — often too short for service startup.
Should I just raise retries until it passes? Only if the service is genuinely slow to start. If the endpoint stays broken the whole time, fix the dependency instead. The free incident assistant can help turn a wall of retry output into the likely upstream cause.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.