Rolling Deploys With Ansible: delegate

I once shipped a “routine” config change to a fleet of twelve web nodes and watched the whole service flatline for ninety seconds. The playbook was correct. The tasks were idempotent. The handlers fired cleanly. The problem was that Ansible, by default, runs every host in a play at the same time, so all twelve nodes restarted their app servers in the same heartbeat. There was no serial. There was no drain. There was no canary. Just a synchronized swan dive into a 503 page, broadcast to every user at once.

That outage taught me the difference between configuring servers and orchestrating a deploy. Ansible is excellent at the first by default and completely indifferent to the second unless you tell it otherwise. This post is the orchestration playbook I wish I’d had: serial for batching, delegate_to for talking to the load balancer, run_once for the one-time stuff, and tight health-check loops so a bad batch stops itself.

Why the default fan-out is a footgun

By default Ansible’s “linear” strategy runs each task across all hosts in the play before moving to the next task. That’s great for apt install. It is catastrophic for systemctl restart app, because every node goes unhealthy in the same instant. Zero downtime requires the opposite: take a small slice of the fleet out of rotation, change it, prove it’s healthy, then move on.

serial is the knob that turns a fleet-wide blast into a rolling wave.

- name: "Rolling web deploy"
  hosts: web
  become: true
  serial: 1
  tasks:
    - name: "Deploy the new release"
      ansible.builtin.copy:
        src: "build/app.tar.gz"
        dest: "/opt/app/releases/app.tar.gz"

With serial: 1, Ansible runs the entire play against one host, finishes it, then starts the next. That’s your canary. But one-at-a-time across 200 nodes is painfully slow, so in practice you ramp.

Ramp up with a serial batch list

serial accepts a single number, a percentage, or a list that describes a ramp. This is the single most important pattern in safe rolling deploys: prove the change on one box, then a small batch, then accelerate.

- name: "Ramped rolling deploy"
  hosts: web
  become: true
  serial:
    - 1
    - "10%"
    - "25%"
    - "50%"
  max_fail_percentage: 0
  tasks:
    - name: "Place the new release directory"
      ansible.builtin.unarchive:
        src: "build/app.tar.gz"
        dest: "/opt/app/releases/{{ release_id }}/"
        remote_src: false

That serial list reads: first deploy to exactly one node (the canary), then 10% of the fleet, then 25%, then 50% per batch until everyone is done. max_fail_percentage: 0 means any host failure in a batch aborts the whole play immediately. On a deploy I would rather stop after one bad node than discover a broken release on a quarter of production.

Pro Tip: percentages in serial round up, and they’re computed against the count of hosts still remaining, not the original total. On small fleets "10%" of 12 hosts is 2, not 1. When you want a true single canary, lead the list with a literal 1 rather than trusting a percentage to give you one.

Drain the node from the load balancer with delegate_to

The whole point of a rolling deploy is that traffic never hits a node mid-change. That means pulling each node out of the load balancer before you touch it and putting it back after it’s healthy. The node can’t do that to itself; the load balancer has to. delegate_to runs a task on a different host while keeping the current host’s variables in scope.

  tasks:
    - name: "Drain {{ inventory_hostname }} from the load balancer"
      ansible.builtin.uri:
        url: "https://{{ lb_admin_host }}/pool/web/members/{{ inventory_hostname }}"
        method: PATCH
        body_format: json
        body:
          state: "disabled"
        headers:
          Authorization: "Bearer {{ lb_api_token }}"
      delegate_to: localhost
      run_once: false

    - name: "Wait for in-flight connections to drain"
      ansible.builtin.wait_for:
        timeout: 20
      delegate_to: localhost

Here delegate_to: localhost runs the API call from the control node, but {{ inventory_hostname }} still resolves to the web node currently being deployed. That’s the magic: the task acts on behalf of the node without running on it. After the file swap and restart, you reverse it to re-enable the member.

    - name: "Re-enable {{ inventory_hostname }} in the load balancer"
      ansible.builtin.uri:
        url: "https://{{ lb_admin_host }}/pool/web/members/{{ inventory_hostname }}"
        method: PATCH
        body_format: json
        body:
          state: "enabled"
        headers:
          Authorization: "Bearer {{ lb_api_token }}"
      delegate_to: localhost

If your LB is an HAProxy box rather than an API, delegate_to: "{{ groups['haproxy'][0] }}" plus a socket command works exactly the same way.

Prove health before re-enabling: until / retries / delay

Re-enabling a node the instant the service starts is how you ship a node that’s “up” but not actually serving. Always poll a real health endpoint and only continue once it’s green.

    - name: "Restart the app on {{ inventory_hostname }}"
      ansible.builtin.systemd:
        name: "app"
        state: "restarted"

    - name: "Wait for {{ inventory_hostname }} to report healthy"
      ansible.builtin.uri:
        url: "http://{{ inventory_hostname }}:8080/healthz"
        status_code: 200
        return_content: true
      register: health
      until: health.status == 200 and 'ok' in health.content
      retries: 30
      delay: 5
      delegate_to: localhost

retries: 30 with delay: 5 gives the node up to 150 seconds to come good. The until condition checks both the status code and the body, because plenty of apps return 200 from a load balancer probe while their dependencies are still warming up. If the loop exhausts its retries, the task fails, and with max_fail_percentage: 0 the entire rollout stops right there with the offending node still drained. That is exactly the behavior you want.

Run-once work: migrations and announcements

Some steps must happen exactly once per deploy, not once per host. Database migrations are the classic example. Running them on every node in a serial batch is a race at best and a corruption at worst. run_once: true runs the task on a single host and shares the result with the rest of the play.

- name: "Pre-deploy one-time tasks"
  hosts: web
  become: true
  tasks:
    - name: "Apply database migrations"
      ansible.builtin.command:
        cmd: "/opt/app/bin/migrate up"
      run_once: true
      delegate_to: "{{ groups['migration_runner'][0] }}"

    - name: "Announce deploy start in chat"
      ansible.builtin.uri:
        url: "{{ slack_webhook }}"
        method: POST
        body_format: json
        body:
          text: "Starting rolling deploy of {{ release_id }}"
      run_once: true
      delegate_to: localhost

Pair run_once with delegate_to when the one-time task should run somewhere specific, like a dedicated migration host that has the right network path to the database.

Two smaller knobs that matter at scale. delegate_facts: true controls where gathered facts get stored when you delegate a setup task, which you need when you want facts about the load balancer rather than about the web node you’re delegating from.

    - name: "Gather facts about the load balancer"
      ansible.builtin.setup:
      delegate_to: "{{ lb_admin_host }}"
      delegate_facts: true

And throttle caps how many hosts run a single task in parallel even inside a larger serial batch. Use it for tasks that hammer a shared resource, like pulling a multi-gigabyte artifact from one registry.

    - name: "Pull the container image"
      community.docker.docker_image:
        name: "registry.internal/app:{{ release_id }}"
        source: "pull"
      throttle: 2

A serial: "25%" batch might be ten hosts, but throttle: 2 means only two of them hit the registry at a time. Batch size controls blast radius; throttle controls stampede.

Where AI fits, and where it absolutely does not

I draft a lot of this orchestration with an AI assistant now, and it’s genuinely good at it. Treat it like a fast, eager junior engineer: it will produce a plausible serial ramp, wire up the delegate_to drain/enable pair, and remember the until health loop faster than I can type it. Tools like Claude or Cursor are great for getting from blank file to first draft, and a sharpened prompt from a prompt pack gets you a better skeleton than “write me an ansible playbook.” If you do this regularly, build a reusable brief in the prompt workspace and keep your house patterns in the prompts library.

But a draft is not a deploy. Three rules I never break:

A human reviews every change. AI cheerfully suggested serial: "50%" as a first batch once. On a six-node fleet that’s three nodes drained at once. Run it through code review and read every line.
Always --check first. ansible-playbook deploy.yml --check --diff dry-runs the whole thing and shows you the diff without touching production. If the AI got a state or a path wrong, this is where it surfaces, not in a live drain.
Always canary with serial: 1. Even after review and dry-run, the first real batch is one node. Watch it, confirm health, then let the ramp continue.

Pro Tip: never hand the AI your vault keys. The assistant drafts the playbook structure; it does not need ansible-vault passwords, the LB API token, or production SSH access to do that job. Keep {{ lb_api_token }} and friends in your vault and out of every prompt. A leaked drafting context is still a leaked credential.

Conclusion

The outage that taught me all this came down to one missing line: serial. Everything else, the delegate_to drain, the run_once migration, the until health loop, the max_fail_percentage circuit breaker, is just making sure that wave moves through the fleet without ever letting traffic hit a node that isn’t ready. AI makes drafting that orchestration dramatically faster, and it’s a genuine force multiplier when you treat it as a junior who never gets to touch the vault. Review every change, dry-run it, canary it, and the same fleet that flatlined on me will roll a deploy with nobody noticing. For more like this, browse the IaC category.

Rolling Deploys With Ansible: delegate_to, serial, and run_once

Why the default fan-out is a footgun

Ramp up with a serial batch list

Drain the node from the load balancer with delegate_to

Prove health before re-enabling: until / retries / delay

Run-once work: migrations and announcements

Where AI fits, and where it absolutely does not

Conclusion

Download the Free 500-Prompt DevOps AI Toolkit

Why the default fan-out is a footgun

Ramp up with a serial batch list

Drain the node from the load balancer with delegate_to

Prove health before re-enabling: until / retries / delay

Run-once work: migrations and announcements

Sharing facts and throttling expensive steps

Where AI fits, and where it absolutely does not

Conclusion

Download the Free 500-Prompt DevOps AI Toolkit