Debugging Ansible Variable Precedence With AI: Why the Wrong Value Wins
Untangle Ansible's 22-level variable precedence with AI. Map where a var is defined, see which value wins, and fix silent group_vars and role override bugs fast.
- #iac
- #ansible
- #debugging
- #ai-assisted
At 2 a.m. on a Tuesday I watched a perfectly healthy deploy push our production API pool to the staging database endpoint. No error. No failed task. Every play went green. The only symptom was a slow trickle of writes landing in the wrong cluster until an on-call engineer noticed the staging disk filling up. The culprit was a single variable, db_host, that had the wrong value. It was defined correctly in host_vars. It was also defined — months earlier, by someone long gone — in group_vars/all.yml. And a stray set_fact three roles deep quietly stomped both of them.
That night taught me a brutal lesson: in Ansible, the bug is rarely the value. The bug is which definition of the value wins. Ansible resolves variables through a precedence ladder with 22 levels, and unless you have it memorized, you are guessing. This post is about how I stopped guessing — and how I now use AI as a fast junior engineer to map every place a variable is defined across a repo and explain, in seconds, who wins.
The 22-Level Ladder Nobody Memorizes
Ansible’s documented precedence runs from weakest to strongest. Roughly, from the bottom: role defaults (defaults/main.yml) sit at the very bottom — they are meant to be overridden. Above them come inventory file/script group vars, then group_vars/all, then group_vars/*, then host_vars, then play vars, vars_files, role vars (vars/main.yml), block vars, task vars, include_vars, set_fact and registered vars, and finally — winning over almost everything — extra vars passed with -e / --extra-vars.
The two traps that bite people:
- Role
defaultsare the weakest thing in the entire system. Anything — anything — overrides them. That is by design. - Role
vars(vars/main.yml) are nearly the strongest. They beathost_vars,group_vars, and play vars. Put a value there “to be safe” and you have just made it impossible to override per host without-e.
So my 2 a.m. story makes sense: set_fact outranks host_vars, which is why the staging value won despite the host being explicitly configured.
See What the Host Actually Resolves To
Before touching a single line, dump what Ansible thinks the merged variable set is. ansible-inventory does this without running a playbook:
ansible-inventory --host web-prod-01.example.com --yaml
That shows you the inventory-sourced vars (group_vars, host_vars, inventory file) merged for one host. It does not show play vars, role vars, or set_fact — those only exist at runtime. For the runtime truth, add a debug task right where it matters:
- name: "Show the resolved db_host at point of use"
ansible.builtin.debug:
var: db_host
Or the one-off, no-playbook version straight from the CLI:
ansible web-prod-01.example.com -m debug -a "var=db_host"
The gap between what ansible-inventory reports and what debug prints at task time is exactly where a set_fact or role vars entry is hiding. That delta is your bug.
Pro Tip: Run your debug task in check mode — ansible-playbook site.yml --check --limit web-prod-01 — so you observe the resolved value without changing anything on the host. Dry-run first, always.
Where AI Earns Its Keep: Mapping Every Definition
Here is the part that used to eat an afternoon. A variable like db_host might be defined in nine places across group_vars/, host_vars/, three roles’ defaults/, one role’s vars/, a vars_files include, and a set_fact. grep -rn db_host . finds the strings, but it does not rank them by precedence — and it misses indirect definitions like db_host: "{{ database_endpoint }}".
This is ideal work for an AI assistant. I paste the grep output (or point a repo-aware tool at the directory) and ask it to build a precedence table. A prompt I reuse:
“Here are every occurrence of
db_hostin this Ansible repo with file paths. For each, classify the precedence level (role defaults, group_vars/all, group_vars/group, host_vars, play vars, role vars, set_fact, extra-vars). Then tell me which one wins for hostweb-prod-01in groupprod, and flag any that look like accidental overrides.”
The model returns a ranked table in seconds and points straight at the set_fact that nobody remembered. Treat that output the way you would treat a junior engineer’s first pass: it is a fast lead, not a verdict. I confirm the winner myself with the debug task above before changing anything. The AI narrows the search space; check mode and a human confirm the truth.
If you want a repeatable version of that prompt, the prompt library has Ansible-debugging starters, and the prompt packs bundle the precedence-mapping and dry-run-review prompts together. For interactive repo-wide tracing I lean on Claude or Cursor, which can read the whole group_vars/ tree at once instead of one grep line at a time.
defaults vs vars: The Override Direction Most People Get Backwards
A concrete example. Say a role ships this:
# roles/app/defaults/main.yml
app_port: 8080
db_host: "db.internal.example.com"
You want web-prod-01 to talk to a dedicated database, so you set:
# host_vars/web-prod-01.example.com.yml
db_host: "prod-db-primary.example.com"
This works, because host_vars beats role defaults. Good. Now imagine a well-meaning teammate “hardens” the role:
# roles/app/vars/main.yml
db_host: "db.internal.example.com"
Suddenly your host_vars override is dead — silently. Role vars outranks host_vars, so every host snaps back to the shared internal DB and no task fails. The only way to win now is extra-vars:
ansible-playbook site.yml -e "db_host=prod-db-primary.example.com"
…which is a terrible place to keep a permanent per-host value. The fix is almost always to move that value out of vars/main.yml and back into defaults/main.yml. When I ask AI to review a role diff, “did this move a variable from defaults to vars and break overridability?” is one of my standard questions — it catches the regression before it ships.
group_vars/all Collisions and the hash_behaviour Trap
The other silent killer is group_vars/all. It is global, it is easy to forget, and it loses to every more-specific group and host. So a value set in group_vars/all.yml looks authoritative in a PR but gets quietly beaten by group_vars/prod.yml. When two definitions of the same key exist at different scopes, the more specific one wins — which is usually what you want, but only if you know both exist.
Now the dictionary trap. By default Ansible replaces entire dictionaries rather than deep-merging them. Given:
# group_vars/all.yml
app_config:
timeout: 30
retries: 5
region: "us-east-1"
# group_vars/prod.yml
app_config:
region: "us-west-2"
With the default hash_behaviour = replace, prod hosts end up with app_config containing only region: "us-west-2". The timeout and retries keys vanish. People reach for hash_behaviour = merge in ansible.cfg, but that is a global, repo-wide footgun — it changes how every dict merges and surprises future readers. The modern, scoped fix is the combine filter:
app_config: "{{ base_app_config | combine(prod_overrides, recursive=True) }}"
When I’m staring at a dict that lost half its keys, I’ll ask an assistant: “Is this a hash_behaviour replace problem? Show me the combine-filter version.” It is exactly the kind of well-documented pattern a model nails — and exactly the kind of config-wide change I never let it apply blind. Run a code review pass on any ansible.cfg change before it merges; flipping hash_behaviour globally deserves a human signature.
Guardrails: AI Is the Junior, You Are the Reviewer
A few rules I never break when AI is in the loop on infrastructure:
- Human reviews every change. The model maps and proposes; you decide. A wrong precedence guess that ships is your incident, not the model’s.
- Always dry-run.
ansible-playbook --check --diffbefore any apply. If a change touches variable resolution, run it limited to one host first. - Never hand AI your vault keys. Don’t paste
ansible-vaultpasswords or decrypted secrets into a prompt. Share variable names and structure, never the encrypted values or the keys that open them. - Verify the winner at the point of use, not just from
ansible-inventory. Runtime facts can override everything you saw statically.
Used inside those rails, AI is genuinely fast at the tedious part — reading every corner of a sprawling inventory and explaining the precedence ladder for a specific host.
Conclusion
Ansible variable bugs are almost never about the value and almost always about which of a dozen definitions wins. The 22-level ladder is the rulebook; ansible-inventory --host, debug: var=, and -m debug are how you observe reality; --check is how you stay safe. AI’s job is to map the maze fast and explain the winner — a sharp junior engineer who never gets bored reading group_vars/. You stay the senior who confirms it, dry-runs it, and signs off.
Start with the IaC guides, grab the ready-made prompt packs, and keep the human in the loop on every apply.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.