Ansible block/rescue/always: AI-Assisted Error Handling That

I once watched a “simple” rolling upgrade playbook die on task 14 of 22. It had drained the load balancer, stopped the app, swapped the binary, and then choked on a database migration that timed out. Ansible did exactly what Ansible does by default: it stopped. The host was now out of the pool, the service was down, and nothing on earth was going to put it back without a human SSHing in at 2 a.m. to manually re-enable the LB and restart the old version. The playbook didn’t fail safe. It failed open, mid-surgery, with the patient on the table.

That night taught me the lesson that this whole post is about: a playbook without recovery logic isn’t automation, it’s a very fast way to break things in a half-completed state. Ansible gives you block, rescue, and always — try/catch/finally for infrastructure — and most playbooks never use them. This is exactly the kind of mechanical, well-understood refactor that AI is genuinely good at. So let me show you how I use it, and where I keep my hands firmly on the wheel.

block/rescue/always is try/catch for your infra

The core construct is three keywords. block is the happy path. rescue runs only if a task in the block fails. always runs no matter what. If you’ve written try/catch/finally in any language, you already understand the control flow.

- name: "Upgrade application with safe recovery"
  block:
    - name: "Drain node from load balancer"
      community.general.haproxy:
        state: disabled
        host: "{{ inventory_hostname }}"
        backend: app_backend

    - name: "Deploy new release"
      ansible.builtin.unarchive:
        src: "releases/{{ app_version }}.tar.gz"
        dest: /opt/app/current

  rescue:
    - name: "Roll back to previous release"
      ansible.builtin.command: /opt/app/bin/rollback.sh
      register: rollback

  always:
    - name: "Re-enable node in load balancer"
      community.general.haproxy:
        state: enabled
        host: "{{ inventory_hostname }}"
        backend: app_backend

Notice the shape of the fix. The dangerous operation — pulling the node out of the pool — has a guaranteed counterpart in always. Whether the deploy succeeds, fails, or the rescue itself stumbles, that node goes back into rotation. That is the entire game: every irreversible-looking action gets a guaranteed undo or completion.

This is the first thing I ask AI to do. I paste a flat playbook into Claude and say: “Identify every task that leaves infrastructure in a partial state on failure, and wrap them in block/rescue/always so the host is never left drained, stopped, or half-migrated.” It’s fast, it’s thorough, and it never gets bored reading 22 tasks. That’s the junior-engineer pitch: speed and stamina, not judgment.

ignore_errors is not error handling

The most common anti-pattern AI will find — and the one it will try to add if you’re not careful — is ignore_errors: true. These are not the same thing.

# WRONG: the failure is swallowed, the playbook marches on blind
- name: "Run migration"
  ansible.builtin.command: /opt/app/bin/migrate.sh
  ignore_errors: true

# RIGHT: the failure is caught, and you actually do something about it
- name: "Migrate with recovery"
  block:
    - name: "Run migration"
      ansible.builtin.command: /opt/app/bin/migrate.sh
  rescue:
    - name: "Restore database from pre-migration snapshot"
      ansible.builtin.command: /opt/app/bin/restore-snapshot.sh
    - name: "Fail loudly so the pipeline stops"
      ansible.builtin.fail:
        msg: "Migration failed; database restored from snapshot. Investigate before retrying."

ignore_errors says “I don’t care if this breaks.” rescue says “I care, and here’s the plan.” When I review AI output, the single most common correction I make is replacing a lazy ignore_errors with a real rescue. The AI reaches for it because it makes the playbook go green, and a fast junior engineer optimizes for green. You are the one who has to know that green-because-ignored is a future incident.

Pro Tip: A rescue block that ends in ansible.builtin.fail is not redundant. It lets you clean up safely AND still surface the failure to your CI pipeline. Recovery and honesty are not mutually exclusive.

failed_when: define what “broken” actually means

Sometimes a non-zero exit code isn’t a failure, and sometimes a zero exit code is. failed_when lets you teach Ansible your real definition so your rescue fires for the right reasons.

- name: "Check cluster health"
  ansible.builtin.command: /opt/app/bin/healthcheck --json
  register: health
  changed_when: false
  failed_when: >
    health.rc != 0 or
    'degraded' in health.stdout or
    'quorum_lost' in health.stdout

This pairs beautifully with blocks: a precise failed_when decides whether you enter rescue, and the rescue decides what to do about it. AI is good at drafting these expressions from a description of the tool’s output, but verify the logic by hand. An off-by-one in a boolean here means your rescue never triggers, or triggers constantly.

Scope failures across the fleet, not just one host

Block-level recovery handles a single host gracefully. But what about the play as a whole? If 3 of your 50 web servers fail to upgrade, do you want the remaining 47 to keep going, or do you want to stop everything immediately? Ansible gives you two dials.

- name: "Rolling upgrade across the fleet"
  hosts: webservers
  serial: 5
  max_fail_percentage: 20
  tasks:
    - name: "Upgrade with per-host recovery"
      block:
        - ansible.builtin.include_tasks: upgrade.yml
      rescue:
        - ansible.builtin.include_tasks: rollback.yml

max_fail_percentage: 20 says: if more than 20% of the batch fails even after their rescues, abort the run rather than chewing through the whole fleet. The stricter sibling is any_errors_fatal:

- name: "Coordinated schema migration — all or nothing"
  hosts: db_primaries
  any_errors_fatal: true
  tasks:
    - name: "Apply schema change"
      ansible.builtin.command: /opt/app/bin/apply-schema.sh

With any_errors_fatal: true, a single host failing aborts the play for every host immediately. Use it when partial application is worse than no application — schema changes, coordinated config rollouts, anything where hosts must stay in lockstep. This is a judgment call about your system’s blast radius, and it’s a great example of where AI proposes but a human decides. The model doesn’t know that your primaries can’t tolerate drift; you do.

Nested blocks and notifying on rescue

Real recovery is rarely one level deep. You can nest blocks, and you should notify humans when a rescue actually fires — a silent rollback at 2 a.m. is how you find out about systemic problems three weeks too late.

- name: "Deploy with layered recovery and alerting"
  block:
    - name: "Inner block — risky migration"
      block:
        - ansible.builtin.command: /opt/app/bin/migrate.sh
      rescue:
        - name: "Restore snapshot"
          ansible.builtin.command: /opt/app/bin/restore-snapshot.sh
        - name: "Re-raise to outer rescue"
          ansible.builtin.fail:
            msg: "Migration failed after snapshot restore"

    - name: "Restart application"
      ansible.builtin.systemd:
        name: app
        state: restarted

  rescue:
    - name: "Page on-call via webhook"
      ansible.builtin.uri:
        url: "{{ alert_webhook_url }}"
        method: POST
        body_format: json
        body:
          text: "Deploy rescue fired on {{ inventory_hostname }} — investigate"

  always:
    - name: "Record deploy outcome"
      ansible.builtin.lineinfile:
        path: /var/log/deploys.log
        line: "{{ ansible_date_time.iso8601 }} {{ app_version }} {{ ansible_failed_task.name | default('ok') }}"

The ansible_failed_task and ansible_failed_result variables are available inside rescue, so your alert can name the exact task that blew up. Wire that webhook into the same place your incident response workflow and monitoring alerts already live, so a fired rescue shows up where humans are actually looking.

The review loop that keeps AI honest

Here’s the workflow I actually run. AI drafts the recovery blocks. A human reads every line. And then — before anything touches a real host — you run check mode.

ansible-playbook upgrade.yml --check --diff

--check is a dry run: Ansible reports what would change without changing it. It’s not perfect (a command task can’t predict its own effect), but it catches an enormous class of mistakes, and it costs you thirty seconds. I treat AI-generated playbooks as guilty until a --check run proves them innocent. If you want a structured second pass on the diff before it merges, run it through a code review pass so the recovery logic gets scrutinized like any other change.

Pro Tip: Run --check --diff against a staging host, then again against ONE production host with --limit, before you ever let the playbook loose on the whole fleet. AI writes the blocks; the gradual rollout is how you survive the blocks being subtly wrong.

And the rule I will repeat until it’s a reflex: never hand AI your vault keys. The model can write block/rescue/always around a task that references {{ vault_db_password }} without ever seeing the secret. Keep ansible-vault decryption out of the loop entirely — paste structure, not secrets. A junior engineer doesn’t get the production vault password on day one, and neither does your AI.

If you want repeatable prompts for this kind of refactor, I keep a few in my prompt library and the larger prompt packs, and the broader IaC category collects the rest of the Ansible and infra automation work.

Conclusion

That broken 2 a.m. upgrade would have been a non-event with twelve lines of rescue and always. The node would have rolled back, re-entered the pool, and paged me with a clear message instead of a dead service. AI can add that safety net across a hundred playbooks faster than I can read one — but it adds the net I tell it to, in the spots I point at, and never ships without a human read and a --check run. Treat it as a fast junior engineer: brilliant at the mechanical work, dangerous when left unsupervised near production. You bring the judgment. It brings the speed. The recovery blocks are where those two things meet.

Ansible block/rescue/always: AI-Assisted Error Handling That Recovers