Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Ansible By James Joyner IV · · 9 min read

Ansible Error Guide: 'FAILED - RETRYING' Until Loop Retries Exhausted

Fix Ansible's 'FAILED - RETRYING ... (retries left)' loop that ends in failure: diagnose until/retries conditions, slow services, wrong success checks, and timeouts.

  • #ansible
  • #troubleshooting
  • #errors
  • #retries

Exact Error Message

FAILED - RETRYING: Wait for API to become healthy (5 retries left).
FAILED - RETRYING: Wait for API to become healthy (4 retries left).
FAILED - RETRYING: Wait for API to become healthy (3 retries left).
FAILED - RETRYING: Wait for API to become healthy (2 retries left).
FAILED - RETRYING: Wait for API to become healthy (1 retries left).
fatal: [app-01]: FAILED! => {
    "attempts": 5,
    "changed": false,
    "msg": "Status code was 503 and not [200]: HTTP Error 503: Service Unavailable"
}

The FAILED - RETRYING lines are not the failure themselves — they are Ansible counting down a retries/until loop. The real failure is the final fatal: line after the attempts are exhausted.

What the Error Means

When a task has until, retries, and delay, Ansible runs it repeatedly until the until condition is true or the retry count is exhausted. Each unsuccessful pass prints FAILED - RETRYING ... (N retries left). If the until condition never becomes true within the allotted attempts, the task fails for real with attempts: N in the result.

So this error means: “I retried as instructed, and the success condition was still not met after the last attempt.” The cause is whatever the loop was waiting for never reaching the desired state in time.

Common Causes

  • The thing being waited on (a service, port, file, or API) genuinely never came up in the retry window.
  • retries x delay is too short for a slow-starting service.
  • The until condition is wrong or too strict (checking for an exact body, status, or value the service never returns).
  • A registered variable is tested incorrectly (e.g. result.status vs result.status_code).
  • The probed endpoint returns a transient code (503, 502) the whole time because a dependency upstream is broken.
  • Facts about success are misread because failed_when/changed_when interact unexpectedly with until.

How to Reproduce the Error

Poll an endpoint that stays unhealthy:

- name: Wait for API to become healthy
  ansible.builtin.uri:
    url: http://app-01:8080/health
    status_code: 200
  register: health
  until: health.status == 200
  retries: 5
  delay: 3
ansible-playbook deploy.yml -i inventory.ini --check --diff -vvv
FAILED - RETRYING: Wait for API to become healthy (5 retries left).
...
fatal: [app-01]: FAILED! => {"attempts": 5, "msg": "Status code was 503 and not [200]: HTTP Error 503: Service Unavailable"}

Diagnostic Commands

Run the play verbosely so you see the full result of each attempt:

ansible-playbook deploy.yml -i inventory.ini --check --diff -vvv

Probe the same endpoint manually to see what it actually returns and how fast:

ansible app-01 -i inventory.ini -m uri -a "url=http://app-01:8080/health status_code=200,503 return_content=true"

Check the service’s real status on the host:

ansible app-01 -i inventory.ini -m command -a "systemctl status myapi --no-pager"

Inspect timing on the connection itself if it is a network/port wait:

ssh -v deploy@app-01 "curl -so /dev/null -w '%{http_code} %{time_total}\n' http://localhost:8080/health"

Step-by-Step Resolution

  1. Look past the retry lines to the final fatal: result. The msg there (status 503, timeout, no route) is the actual cause.

  2. Verify the success condition matches the data. For uri, the field is status, not status_code, on the registered result; confirm with -vvv output. Fix the until:

  until: health.status == 200
  1. Right-size the wait window. If the service needs 60s to start, retries: 5, delay: 3 (15s) is far too short. Increase appropriately:
  retries: 30
  delay: 5
  1. Decide if the dependency is the real problem. If the endpoint returns 503 for the entire window, the service or its upstream is broken — fix that before tuning retries. Check logs:
ansible app-01 -i inventory.ini -m command -a "journalctl -u myapi --no-pager -n 50"
  1. Loosen brittle checks. If you require an exact body that varies, match on status code or a substring instead.

  2. Re-run and confirm the loop now succeeds within the window:

ansible-playbook deploy.yml -i inventory.ini --check --diff
ok: [app-01]

Prevention and Best Practices

  • Size retries x delay to the service’s realistic worst-case startup time, with headroom.
  • Prefer purpose-built waiters: wait_for for ports/files, wait_for_connection for reboots, rather than hand-rolled until loops where possible.
  • Make success conditions match the actual registered field. Run once with -vvv to confirm field names before trusting them.
  • Treat persistent 503/502 during the whole window as a dependency failure to fix, not a retry count to raise.
  • Add a clear name: so FAILED - RETRYING lines tell you exactly what is being waited on.
  • Surface the failing dependency in alerts; a wait loop exhausting retries usually means something upstream is down.
  • The conditional check '...' failed — the until expression itself is invalid, not just unmet.
  • Timeout when waiting for ... from wait_for — a port/file waiter timed out rather than an until loop.
  • Status code was 503 and not [200] — the inner uri failure that drives the retries.
  • UNREACHABLE! — the host itself dropped, which can also exhaust connection retries.

Frequently Asked Questions

Is FAILED - RETRYING an error? Not by itself. It is normal progress output for a retry loop. Only the final fatal: after the attempts run out is the error.

My until never succeeds even though the service is up. The condition probably tests the wrong field. For uri, use result.status; for command, use result.rc. Confirm field names with -vvv.

How long does my loop actually wait? Roughly retries * delay seconds plus task execution time. With retries: 5, delay: 3 that is about 15 seconds — often too short for service startup.

Should I just raise retries until it passes? Only if the service is genuinely slow to start. If the endpoint stays broken the whole time, fix the dependency instead. The free incident assistant can help turn a wall of retry output into the likely upstream cause.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.