Making Flaky Ansible Tasks Reliable With AI: retries, until

Last Tuesday a playbook that provisions OpenStack control-plane nodes failed on step 14 of 30 because a service hadn’t finished binding to its port yet. The fix someone had committed weeks earlier was a retries: 5 slapped on the task. It “worked” in the sense that the red went away. It also meant that when the service was genuinely misconfigured and would never come up, the play sat there retrying for two and a half minutes before failing with a useless message. That is the trap with flaky tasks: the difference between a task that needs a moment to settle and a task that is actually broken is a condition you have to think about, and a blind retry erases that distinction.

This is exactly the kind of work where I let AI do the typing and keep the judgment for myself. AI is genuinely good at remembering the exact syntax for until/retries/delay, at knowing that wait_for_connection exists and wait_for is a different module, and at drafting a sensible polling block. What it cannot do is know whether your success condition is correct. That part stays with you.

A bare retries with no until does nothing useful in most modules, because most modules don’t fail-then-succeed on their own. Retries only make sense paired with a register and an until that describes what “done” actually looks like. When I see a retry without a condition, I assume the author was treating retries as a delay mechanism, and that’s almost always wrong.

The deeper problem is that retries hide signal. A service that takes 8 seconds to start on a cold node but 45 seconds under load is telling you something about resource contention. If you bury that behind retries: 30, delay: 5, you never see the degradation until the day it blows past 150 seconds. Reliable does not mean silent. Reliable means the play waits for a specific, observable condition and fails loudly when that condition genuinely can’t be met.

So my rule is: every retry must answer two questions. What am I waiting for, and how do I know when to stop? If you can’t write the until expression, you don’t understand the failure yet, and AI guessing it for you won’t change that.

retries + until + delay: waiting on a real condition

Here’s the pattern I actually want. Poll an HTTP health endpoint until it returns 200, with a bounded number of attempts and a sane delay between them:

- name: Wait for the API to report healthy
  ansible.builtin.uri:
    url: "https://{{ inventory_hostname }}:9292/healthcheck"
    method: GET
    status_code: 200
    validate_certs: false
  register: api_health
  until: api_health.status == 200
  retries: 12
  delay: 5
  # 12 * 5s = up to 60s, then fail loudly

Read that until carefully, because it’s the whole point. We’re not retrying until the task stops raising an error; we’re retrying until a thing we can observe is true. If after 60 seconds the endpoint still isn’t 200, the play fails with the last registered result attached, so you can see what it was returning. That’s the loud failure I want.

A subtle gotcha AI gets right more often than humans do: when you use until, the module’s own failure conditions still apply on the final attempt. With uri and status_code: 200, a non-200 on the last try is a real failure, which is what you want. Be deliberate about failed_when if you need the task to keep polling through transient 503s rather than bailing on the first one.

This is a great spot to lean on AI, and a perfect example of debugging Ansible failures faster with AI: paste the failing task and the error, and ask for the minimal correct retry. A prompt I keep around:

Here’s an Ansible task that intermittently fails because the service isn’t ready yet. Add until/retries/delay so it polls a real readiness condition. Do not just wrap it in retries — tell me explicitly what the success condition is and why, and warn me if the module already fails on the final attempt. Assume I will verify the condition myself.

That last sentence matters. I’m telling the model the division of labor up front: you draft, I verify. The output I get back proposes the condition and explains it, which is exactly the artifact I can audit in ten seconds.

wait_for: the port is open, but is the service ready?

wait_for is the right tool when you’re waiting on something at the socket or filesystem level rather than an application response. Classic use: a service that forks and you need to know its listening port is actually accepting connections.

- name: Wait for the database to accept connections
  ansible.builtin.wait_for:
    host: "{{ db_host }}"
    port: 5432
    state: started
    timeout: 90
    delay: 2

Here’s the honest caveat I always raise with my team: an open port is not the same as a ready service. PostgreSQL can be listening on 5432 while still replaying WAL and rejecting queries. wait_for on a port answers “is something bound here,” not “can I do useful work.” For true readiness you often need a wait_for on a port plus a real query afterward, or just the uri/command-with-until pattern from the previous section. AI will happily generate the wait_for and call it done; you’re the one who has to know that the port being open isn’t the condition you actually care about.

wait_for can also watch a file or a string in a log, which is occasionally the cleanest signal:

- name: Wait for the bootstrap marker to appear
  ansible.builtin.wait_for:
    path: /var/lib/myservice/.bootstrapped
    state: present
    timeout: 120

wait_for_connection: surviving reboots

Reboots are where flakiness gets expensive, because SSH dies, comes back, and your play has to ride through the gap. The reboot module handles most of this, but when you reboot out-of-band or need finer control, wait_for_connection is the tool. It waits until Ansible’s own connection plugin can reach the host again — not just until a port is open, but until it can actually run a command.

- name: Reboot the node to apply the new kernel
  ansible.builtin.reboot:
    reboot_timeout: 600
    post_reboot_delay: 15

- name: Confirm the host is truly back before continuing
  ansible.builtin.wait_for_connection:
    delay: 10
    timeout: 300

The delay before the first check is not optional cargo-culting. Some systems keep SSH up for a few seconds into the shutdown, so if you start polling immediately you’ll get a false “it’s up!” right before the host actually goes down. Waiting 10–15 seconds before the first attempt avoids that race. This is precisely the kind of timing detail I want AI to remind me of, and precisely the kind of value I’d never trust it to set blindly for my hardware — a slow BMC or a node that runs fsck on boot needs a much longer timeout, and only I know that.

poll, async, and genuinely long operations

For operations that legitimately take minutes — a large image conversion, a slow package transaction — don’t hold the connection open and don’t fake it with a giant retry count. Fire the task async and poll:

- name: Run the long migration in the background
  ansible.builtin.command: /usr/local/bin/migrate-cluster.sh
  async: 1800        # allow up to 30 minutes
  poll: 0            # fire and don't block
  register: migration

- name: Wait for the migration to finish
  ansible.builtin.async_status:
    jid: "{{ migration.ansible_job_id }}"
  register: migration_result
  until: migration_result.finished
  retries: 60
  delay: 30

poll: 0 starts the job and moves on; async_status with an until polls for completion. This is structurally the same “wait for an observable condition” idea, just for work that’s too long to block on. The timeout math should be deliberate: retries * delay here is 30 minutes, matching the async budget. If they disagree, you get confusing failures, so I always make AI show me the arithmetic.

Where the human stays in the loop

Every snippet above hinges on one human-owned decision: is the condition correct, and is the timeout honest for this environment? AI will draft until: result.rc == 0 when what you actually needed was until: 'ready' in result.stdout. It’ll suggest timeout: 60 when your storage array sometimes takes three minutes. The syntax it nails; the semantics are yours.

My workflow is boring on purpose. I describe the flaky behavior, ask for a retry that names its success condition, then I read the until expression out loud and ask “is this the thing I actually care about?” If the answer is no, the draft was still useful — it framed the question. I keep a few of these retry-hardening prompts in my prompt library so the framing is consistent across the team, and the broader Ansible playbook patterns we’ve collected lean on the same principle: automate the typing, never the judgment.

Flaky tasks don’t get reliable because you retried them. They get reliable because you figured out what “done” means and made the play wait for exactly that — no more, no less. AI is a fast, knowledgeable pair for getting there. It is not the one who decides the condition is right. That’s still the job.

Making Flaky Ansible Tasks Reliable With AI: retries, until, and wait_for

Why blind retries are a smell, not a fix

retries + until + delay: waiting on a real condition

wait_for: the port is open, but is the service ready?

wait_for_connection: surviving reboots

poll, async, and genuinely long operations

Where the human stays in the loop

Download the Free 500-Prompt DevOps AI Toolkit