Pre-Flight Checks in Ansible With assert and fail

A few years ago I ran a playbook that restarted a service and flushed a cache. It was meant for staging. The env variable was supposed to be staging, but I’d refactored a group_vars file the night before and left env undefined. My playbook had a fallback: env | default('production'). You can already see where this is going. The playbook ran cleanly, exited green, and took production down for ninety seconds during business hours. No error. No warning. Just a silent default doing exactly what I told it to do, against exactly the host I didn’t mean to touch.

Ansible will happily run against the wrong host with missing variables, because Ansible has no idea what “wrong” means. You have to tell it. The tools for telling it — ansible.builtin.assert, the fail module, failed_when, and fact-based gating — have been in the box the whole time. Most playbooks just never use them. This post is about building pre-flight guards that make a playbook refuse to run when the world isn’t what it expects.

Start with a preflight file that always runs

The single best structural change you can make is to put your guards in a dedicated tasks file and import it first, tagged always, so no --tags invocation can skip it.

# roles/app/tasks/main.yml
- name: Run pre-flight validation before anything else
  ansible.builtin.import_tasks: preflight.yml
  tags: ["always"]

- name: Deploy the application
  ansible.builtin.import_tasks: deploy.yml

The always tag matters. Without it, someone running --tags deploy sails right past your checks. With it, the guards run regardless. Everything below lives in preflight.yml.

Assert that required variables are actually defined

The assert module is your primary tool. It takes a that: list of conditions, all of which must be true, and lets you write a human fail_msg.

- name: Required variables are defined
  ansible.builtin.assert:
    that:
      - app_version is defined
      - target_env is defined
      - db_host is defined
    fail_msg: "Missing required vars. Set app_version, target_env and db_host in group_vars."
    success_msg: "All required variables present."
    quiet: true

quiet: true suppresses the per-assertion output so a passing check doesn’t spam your log. The key discipline here: never use | default() to paper over a variable that determines which environment you touch. A default value for a tuning knob is fine. A default for target_env is how you get my production outage. If the variable matters, assert it exists and let the play stop.

Validate the value, not just the presence

is defined only proves the variable exists. It can still hold garbage. Assert the shape of the value too.

- name: target_env is one we recognise
  ansible.builtin.assert:
    that:
      - target_env in ["staging", "production", "dev"]
    fail_msg: "target_env='{{ target_env }}' is not a known environment."
    quiet: true

- name: Production deploys require an explicit confirmation flag
  ansible.builtin.fail:
    msg: "Refusing to deploy to production without confirm_prod=true."
  when:
    - target_env == "production"
    - not (confirm_prod | default(false) | bool)

That second guard uses the fail module instead of assert because the logic is conditional on context — it only trips for production. The fail module is the right call whenever your guard is “in this situation, stop,” rather than “this must always be true.” Both end the play; they just read differently, and readability is the whole point of a guard.

Gate on the facts: OS family and distribution

A playbook written for RedHat-family hosts that lands on a Debian box will fail in confusing, halfway-applied ways. Catch it up front using gathered facts.

- name: This role only supports the RedHat OS family
  ansible.builtin.assert:
    that:
      - ansible_facts["os_family"] == "RedHat"
    fail_msg: >-
      Unsupported OS family '{{ ansible_facts["os_family"] }}'.
      This role targets RHEL/Rocky/Alma only.
    quiet: true

- name: Distribution is on the supported list
  ansible.builtin.assert:
    that:
      - ansible_facts["distribution"] in ["RedHat", "Rocky", "AlmaLinux"]
    fail_msg: "Distribution '{{ ansible_facts[\"distribution\"] }}' is not supported."
    quiet: true

Note the quoting once you’re indexing ansible_facts["distribution"] inside a double-quoted YAML scalar — escape the inner quotes or the parser chokes. This is exactly the kind of fiddly detail where I lean on AI to get the YAML right the first time.

Compare versions properly with the version test

String comparison on version numbers is a classic trap: "2.10" < "2.9" is true as a string and false as a version. Ansible’s version test does it correctly.

- name: Ansible is new enough for this role
  ansible.builtin.assert:
    that:
      - ansible_version.full is version("2.14", ">=")
    fail_msg: "Needs ansible-core >= 2.14, found {{ ansible_version.full }}."
    quiet: true

- name: Kernel meets the minimum
  ansible.builtin.assert:
    that:
      - ansible_facts["kernel"] is version("5.4", ">=", version_type="loose")
    fail_msg: "Kernel {{ ansible_facts['kernel'] }} is older than 5.4."
    quiet: true

Gate on inventory group membership

If a play should only ever touch web servers, say so. A host that wandered into the wrong group is a real failure mode, and group_names makes it trivial to catch.

- name: This host belongs to the web group
  ansible.builtin.assert:
    that:
      - "'web' in group_names"
    fail_msg: >-
      {{ inventory_hostname }} is not in the 'web' group
      (groups: {{ group_names | join(", ") }}). Refusing to run.
    quiet: true

Pro Tip: Run your whole playbook in --check mode against an intentionally wrong target — a Debian box, a host in the db group, an inventory with target_env unset — and confirm each guard trips with the message you expect. A guard you’ve never seen fail is a guard you don’t actually know works. I treat “watch it fail correctly” as part of writing the guard, not an optional extra.

Check connectivity and headroom before you change anything

Some failures only show up halfway through a run, when you’ve already mutated half the box. Validate the preconditions for the real work first. Use failed_when to turn a normally-soft command result into a hard stop.

- name: Confirm we can reach the database host
  ansible.builtin.wait_for:
    host: "{{ db_host }}"
    port: 5432
    timeout: 5
  register: db_reachable

- name: Capture free space on the deploy target
  ansible.builtin.command: df --output=avail -BG /opt
  register: df_out
  changed_when: false
  failed_when: df_out.rc != 0

- name: Require at least 5 GB free before deploying
  ansible.builtin.assert:
    that:
      - (df_out.stdout_lines[1] | regex_replace("[^0-9]", "") | int) >= 5
    fail_msg: "Less than 5 GB free on /opt. Free space before deploying."
    quiet: true

changed_when: false on the df command keeps a read-only check from reporting a phantom change, and failed_when makes the play stop if the command itself errors. This is the difference between “the disk filled up mid-copy and left a corrupt artifact” and “the play refused to start.”

Let AI draft the guards, but keep the keys

Writing these guards by hand is tedious, and tedium is where corners get cut. This is genuinely good work for an LLM. Treat it like a fast junior engineer: hand it your deploy.yml, your group_vars, and a sentence — “generate a preflight.yml that asserts every required variable is defined, restricts this to the RedHat family and the web group, and checks for 5 GB free on /opt.” You’ll get a solid first draft in seconds, with the fiddly version tests and fact lookups already correct. Tools like Claude, Cursor, or GitHub Copilot are all strong at this shape of mechanical, pattern-heavy YAML, and a good prompt or a focused prompt pack gets you a reusable template.

But a guard you didn’t read is worse than no guard, because it lulls you into trusting a check that may assert the wrong thing. I review every generated line, and I never hand the model my vault. AI drafts the assertions; it does not get the secrets those assertions protect, and it does not get to decide what counts as “production.” Run the draft through check mode, watch it fail against a wrong target, then commit it. The model writes the guard; you remain the one who decides what a wrong host even is.

Pro Tip: Ask the AI to also generate the deliberately-broken invocation that should trip each guard. A model that can articulate how its own check fails has usually written a check that actually fails — and you get your --check test cases for free.

Conclusion

My ninety-second outage came from a single missing variable and a default that hid it. Every guard in this post exists so that a playbook stops loudly instead of proceeding quietly. Put a preflight.yml tagged always at the front of your roles, assert your variables and your facts, gate on group membership, check connectivity and headroom, and let AI draft the boilerplate while you keep the vault and the final say. The goal isn’t a playbook that never fails. It’s a playbook that fails before it touches anything, on the wrong host, every time. For more on hardening your automation, the IaC category has the rest of the series.