Designing group_vars and host_vars for Multi-Environment

Last quarter I inherited an Ansible repo where staging and prod had drifted so far apart that a “tested” playbook nuked the wrong NTP servers on a prod hypervisor. The root cause wasn’t the playbook. It was a group_vars layout that copy-pasted the same forty variables into dev.yml, staging.yml, and prod.yml, with one typo that only existed in prod. Nobody could see the difference because there was no single source of truth to diff against. That’s the failure mode I want to help you avoid, and it’s exactly the kind of structural work where AI earns its keep: it drafts the layout, decodes someone else’s mess, and reviews your precedence assumptions. You still verify every line, because precedence bugs are silent until they aren’t.

The layout AI should draft for you

When I onboard onto a new inventory or start a clean one, I ask the model to propose a directory tree first, before a single variable gets written. Here’s the kind of prompt that produces something usable:

You are an Ansible expert. Design a group_vars/host_vars directory layout for an inventory with three environments (dev, staging, prod), each containing web and db host groups. Show the full tree. Put shared defaults in all.yml, environment-wide values in per-environment files, and role-type values in group files. Keep all secrets out of plaintext and note where vault-encrypted files go. Explain the precedence order at the end.

A good answer comes back looking like this:

# inventory/prod/hosts.ini
[web]
web01.prod.internal
web02.prod.internal

[db]
db01.prod.internal

[prod:children]
web
db

inventory/
├── prod/
│   ├── hosts.ini
│   ├── group_vars/
│   │   ├── all.yml          # defaults for every host in prod
│   │   ├── web.yml          # web-tier values
│   │   ├── db.yml           # db-tier values
│   │   └── vault.yml        # encrypted secrets for prod
│   └── host_vars/
│       └── db01.prod.internal.yml
├── staging/
│   └── ... same shape ...
└── dev/
    └── ... same shape ...

I deliberately keep a separate inventory directory per environment rather than one giant inventory with dev_web, staging_web, prod_web groups. Separate directories mean you physically cannot run a prod playbook against staging without changing the -i flag, and a git diff inventory/staging inventory/prod shows real differences instead of grep noise. AI will happily generate the single-inventory version if you ask for it, which is why you state the constraint up front.

all.yml is for defaults, not for everything

The most common mistake I see, and the one AI will reproduce if you let it, is treating all.yml as a junk drawer. The rule I hold the line on: all.yml holds values that are genuinely the same everywhere or sensible defaults that lower-precedence layers can override. Everything environment-specific lives in the per-environment all.yml inside that inventory directory.

# inventory/prod/group_vars/all.yml
# Shared baseline for every prod host
ntp_servers:
  - 0.pool.ntp.org
  - 1.pool.ntp.org
timezone: "UTC"
common_packages:
  - vim
  - curl
  - chrony
log_retention_days: 30
environment_name: "prod"

Then the group files carry only what makes a web host a web host:

# inventory/prod/group_vars/web.yml
nginx_worker_processes: auto
nginx_keepalive_timeout: 65
app_listen_port: 8443
deploy_user: "appdeploy"
feature_flags:
  new_billing: true

The discipline here is subtractive. Before a variable goes into web.yml, ask whether it differs by tier. If it doesn’t, it belongs one level up. I lean on AI for this review pass because a human skims past duplication; the model is good at flagging “this key appears identically in web.yml and db.yml, hoist it to all.yml.”

Killing duplication across environments

This is where most teams give up and copy-paste. The trick is to push the shape of your config into group_vars/all.yml at the repo root (a true global, above any environment) and let each environment override only the leaf values. Ansible merges these if you enable hash merging, but I avoid hash_behaviour = merge because it’s invisible and bites you later. Instead I keep variables flat and override whole keys, which is explicit and diffable.

# inventory/prod/group_vars/all.yml
app_replicas: 6
db_max_connections: 500
log_retention_days: 90

# inventory/dev/group_vars/all.yml
app_replicas: 1
db_max_connections: 50
log_retention_days: 7

Same keys, different values, no shared file pretending to be clever. When someone asks “what’s different about prod,” the answer is a three-line diff. If you want to go deeper on why a value resolves the way it does, I walked through the debugging side of this in debugging Ansible variable precedence with AI — worth reading alongside this if precedence still feels like guesswork.

Precedence layering, decoded

You cannot design a good layout without internalizing the precedence order, and this is genuinely where AI shines as a decoder. Paste your variable sprawl and ask it to predict the resolved value, then verify with ansible-inventory. The layers that matter here, low to high:

inventory/group_vars/all.yml (role defaults are even lower)
inventory/group_vars/<group>.yml
inventory/host_vars/<host>.yml
-e extra vars (always wins)

A host_vars file beats every group file, which is exactly what you want for the one snowflake box:

# inventory/prod/host_vars/db01.prod.internal.yml
# This is the primary; it carries more connections than its peers
db_max_connections: 800
backup_role: "primary"

Never trust your mental model. Verify it:

# See the full group/host tree
ansible-inventory -i inventory/prod/hosts.ini --graph

# Dump the fully-resolved variables for one host
ansible-inventory -i inventory/prod/hosts.ini --host db01.prod.internal

That --host dump is the source of truth. When AI tells you a value will resolve a certain way, this command is how you catch it being wrong before prod does. I treat the model’s precedence explanation as a hypothesis and this command as the test.

Environment-specific overrides without forking logic

The goal is one set of roles and playbooks driven entirely by data. Your tasks reference app_replicas and db_max_connections; the inventory decides the numbers. That means an environment override is never a code branch — it’s a value in the right file. If you find yourself writing when: environment_name == "prod" in a task, that’s usually a smell that a variable should have absorbed the difference instead.

# A task that stays environment-agnostic
- name: Render app config
  ansible.builtin.template:
    src: app.conf.j2
    dest: /etc/app/app.conf
  vars:
    replicas: "{{ app_replicas }}"
    max_conn: "{{ db_max_connections }}"

When you do need a genuinely structural difference, scope it to the group file rather than littering conditionals across roles. AI is useful here for spotting the conditionals you’ve scattered and proposing the variable that collapses them — ask it to “find when: clauses that branch on environment and suggest variable-driven replacements.”

Secrets stay in vault, full stop

No secret ever lands in all.yml, a group file, or a host file in plaintext. Each environment gets its own vault.yml, encrypted with its own key, so a leaked dev key never decrypts prod. The pattern I use is a plaintext “pointer” variable that references a vaulted value, which keeps the encrypted blob isolated and greppable:

# inventory/prod/group_vars/all.yml (plaintext)
db_password: "{{ vault_db_password }}"
api_token: "{{ vault_api_token }}"

# inventory/prod/group_vars/vault.yml (ansible-vault encrypted)
vault_db_password: "actual-secret-here"
vault_api_token: "actual-token-here"

ansible-vault encrypt inventory/prod/group_vars/vault.yml
ansible-playbook -i inventory/prod/hosts.ini site.yml --ask-vault-pass

This indirection means your roles reference db_password and never touch the vault namespace directly, and you can audit every secret by grepping for vault_. Be careful what you paste into an AI tool: decrypt nothing into a prompt. Ask the model about structure — “is this vault layout sane, are secrets isolated per environment” — never the values. I keep my whole vault workflow, including key rotation and the editor dance, in managing Ansible vault secrets without losing your mind.

Where the human stays in control

AI drafts the tree, decodes the precedence, and flags the duplication faster than you would by hand. What it cannot do is know that db01 is your primary, that prod NTP must point at the internal pool, or that one feature flag is mid-rollout. Those are facts about your environment, and getting them wrong is how you nuke the right config in the wrong place.

So the loop is: let the model propose the layout, then run ansible-inventory --graph and --host on every environment and read the resolved output yourself. Diff the environments. Confirm no secret sits in plaintext. If you want a starting set of prompts for this kind of structural review, I keep a few in the prompt library, and the rest of my Ansible writeups live under the Ansible category. The layout the AI hands you is a draft. The inventory you ship is yours.

Designing group_vars and host_vars for Multi-Environment Inventories With AI