Debugging Ansible Variable Precedence With AI: Why the Wrong

At 2 a.m. on a Tuesday I watched a perfectly healthy deploy push our production API pool to the staging database endpoint. No error. No failed task. Every play went green. The only symptom was a slow trickle of writes landing in the wrong cluster until an on-call engineer noticed the staging disk filling up. The culprit was a single variable, db_host, that had the wrong value. It was defined correctly in host_vars. It was also defined — months earlier, by someone long gone — in group_vars/all.yml. And a stray set_fact three roles deep quietly stomped both of them.

That night taught me a brutal lesson: in Ansible, the bug is rarely the value. The bug is which definition of the value wins. Ansible resolves variables through a precedence ladder with 22 levels, and unless you have it memorized, you are guessing. This post is about how I stopped guessing — and how I now use AI as a fast junior engineer to map every place a variable is defined across a repo and explain, in seconds, who wins.

The 22-Level Ladder Nobody Memorizes

Ansible’s documented precedence runs from weakest to strongest. Roughly, from the bottom: role defaults (defaults/main.yml) sit at the very bottom — they are meant to be overridden. Above them come inventory file/script group vars, then group_vars/all, then group_vars/*, then host_vars, then play vars, vars_files, role vars (vars/main.yml), block vars, task vars, include_vars, set_fact and registered vars, and finally — winning over almost everything — extra vars passed with -e / --extra-vars.

The two traps that bite people:

Role defaults are the weakest thing in the entire system. Anything — anything — overrides them. That is by design.
Role vars (vars/main.yml) are nearly the strongest. They beat host_vars, group_vars, and play vars. Put a value there “to be safe” and you have just made it impossible to override per host without -e.

So my 2 a.m. story makes sense: set_fact outranks host_vars, which is why the staging value won despite the host being explicitly configured.

See What the Host Actually Resolves To

Before touching a single line, dump what Ansible thinks the merged variable set is. ansible-inventory does this without running a playbook:

ansible-inventory --host web-prod-01.example.com --yaml

That shows you the inventory-sourced vars (group_vars, host_vars, inventory file) merged for one host. It does not show play vars, role vars, or set_fact — those only exist at runtime. For the runtime truth, add a debug task right where it matters:

- name: "Show the resolved db_host at point of use"
  ansible.builtin.debug:
    var: db_host

Or the one-off, no-playbook version straight from the CLI:

ansible web-prod-01.example.com -m debug -a "var=db_host"

The gap between what ansible-inventory reports and what debug prints at task time is exactly where a set_fact or role vars entry is hiding. That delta is your bug.

Pro Tip: Run your debug task in check mode — ansible-playbook site.yml --check --limit web-prod-01 — so you observe the resolved value without changing anything on the host. Dry-run first, always.

Where AI Earns Its Keep: Mapping Every Definition

Here is the part that used to eat an afternoon. A variable like db_host might be defined in nine places across group_vars/, host_vars/, three roles’ defaults/, one role’s vars/, a vars_files include, and a set_fact. grep -rn db_host . finds the strings, but it does not rank them by precedence — and it misses indirect definitions like db_host: "{{ database_endpoint }}".

This is ideal work for an AI assistant. I paste the grep output (or point a repo-aware tool at the directory) and ask it to build a precedence table. A prompt I reuse:

“Here are every occurrence of db_host in this Ansible repo with file paths. For each, classify the precedence level (role defaults, group_vars/all, group_vars/group, host_vars, play vars, role vars, set_fact, extra-vars). Then tell me which one wins for host web-prod-01 in group prod, and flag any that look like accidental overrides.”

The model returns a ranked table in seconds and points straight at the set_fact that nobody remembered. Treat that output the way you would treat a junior engineer’s first pass: it is a fast lead, not a verdict. I confirm the winner myself with the debug task above before changing anything. The AI narrows the search space; check mode and a human confirm the truth.

If you want a repeatable version of that prompt, the prompt library has Ansible-debugging starters, and the prompt packs bundle the precedence-mapping and dry-run-review prompts together. For interactive repo-wide tracing I lean on Claude or Cursor, which can read the whole group_vars/ tree at once instead of one grep line at a time.

defaults vs vars: The Override Direction Most People Get Backwards

A concrete example. Say a role ships this:

# roles/app/defaults/main.yml
app_port: 8080
db_host: "db.internal.example.com"

You want web-prod-01 to talk to a dedicated database, so you set:

# host_vars/web-prod-01.example.com.yml
db_host: "prod-db-primary.example.com"

This works, because host_vars beats role defaults. Good. Now imagine a well-meaning teammate “hardens” the role:

# roles/app/vars/main.yml
db_host: "db.internal.example.com"

Suddenly your host_vars override is dead — silently. Role vars outranks host_vars, so every host snaps back to the shared internal DB and no task fails. The only way to win now is extra-vars:

ansible-playbook site.yml -e "db_host=prod-db-primary.example.com"

…which is a terrible place to keep a permanent per-host value. The fix is almost always to move that value out of vars/main.yml and back into defaults/main.yml. When I ask AI to review a role diff, “did this move a variable from defaults to vars and break overridability?” is one of my standard questions — it catches the regression before it ships.

group_vars/all Collisions and the hash_behaviour Trap

The other silent killer is group_vars/all. It is global, it is easy to forget, and it loses to every more-specific group and host. So a value set in group_vars/all.yml looks authoritative in a PR but gets quietly beaten by group_vars/prod.yml. When two definitions of the same key exist at different scopes, the more specific one wins — which is usually what you want, but only if you know both exist.

Now the dictionary trap. By default Ansible replaces entire dictionaries rather than deep-merging them. Given:

# group_vars/all.yml
app_config:
  timeout: 30
  retries: 5
  region: "us-east-1"

# group_vars/prod.yml
app_config:
  region: "us-west-2"

With the default hash_behaviour = replace, prod hosts end up with app_config containing only region: "us-west-2". The timeout and retries keys vanish. People reach for hash_behaviour = merge in ansible.cfg, but that is a global, repo-wide footgun — it changes how every dict merges and surprises future readers. The modern, scoped fix is the combine filter:

app_config: "{{ base_app_config | combine(prod_overrides, recursive=True) }}"

When I’m staring at a dict that lost half its keys, I’ll ask an assistant: “Is this a hash_behaviour replace problem? Show me the combine-filter version.” It is exactly the kind of well-documented pattern a model nails — and exactly the kind of config-wide change I never let it apply blind. Run a code review pass on any ansible.cfg change before it merges; flipping hash_behaviour globally deserves a human signature.

Guardrails: AI Is the Junior, You Are the Reviewer

A few rules I never break when AI is in the loop on infrastructure:

Human reviews every change. The model maps and proposes; you decide. A wrong precedence guess that ships is your incident, not the model’s.
Always dry-run. ansible-playbook --check --diff before any apply. If a change touches variable resolution, run it limited to one host first.
Never hand AI your vault keys. Don’t paste ansible-vault passwords or decrypted secrets into a prompt. Share variable names and structure, never the encrypted values or the keys that open them.
Verify the winner at the point of use, not just from ansible-inventory. Runtime facts can override everything you saw statically.

Used inside those rails, AI is genuinely fast at the tedious part — reading every corner of a sprawling inventory and explaining the precedence ladder for a specific host.

Conclusion

Ansible variable bugs are almost never about the value and almost always about which of a dozen definitions wins. The 22-level ladder is the rulebook; ansible-inventory --host, debug: var=, and -m debug are how you observe reality; --check is how you stay safe. AI’s job is to map the maze fast and explain the winner — a sharp junior engineer who never gets bored reading group_vars/. You stay the senior who confirms it, dry-runs it, and signs off.

Start with the IaC guides, grab the ready-made prompt packs, and keep the human in the loop on every apply.

Debugging Ansible Variable Precedence With AI: Why the Wrong Value Wins