Taming ansible-lint With AI: From a Wall of Warnings to Clean Runs
Use AI to triage a noisy ansible-lint report, write a sane .ansible-lint config, fix rule violations, and wire it into CI, with human review and dry runs.
- #iac
- #ansible
- #ai
- #ansible-lint
- #ci
I inherited a repo last month with exactly one piece of documentation: a Slack message that said “don’t run the linter, it’ll ruin your day.” Naturally, I ran the linter. ansible-lint printed 400-some violations and then my terminal kept scrolling for what felt like a full minute. Nobody on the team wanted to touch it, because the playbooks worked, and “working but ugly” beats “broken but pretty” in production.
But 400 warnings is not a style problem. Buried in that wall were real footguns: shell pipes that silently swallow failures, tasks that always report changed, and modules called by short name that could break on the next Ansible upgrade. So I did what I now do with every legacy lint mess: I used AI as a fast junior engineer to triage and propose fixes, while I stayed the adult in the room who actually reviews and runs things. Here’s the workflow that got me to a clean run without breaking a single playbook.
First, See What You’re Actually Dealing With
Before fixing anything, get a count by rule. A raw scroll is useless; a histogram tells you where the bodies are buried.
ansible-lint -p 2>/dev/null | grep -oE '^[a-z-]+\[[a-z-]+\]|^[a-z-]+:' | sort | uniq -c | sort -rn
On my repo this produced something like:
142 name[casing]
78 fqcn[action-core]
53 no-changed-when
41 risky-shell-pipe
29 var-naming[pattern]
18 yaml[line-length]
...
This is the single most useful artifact to hand an AI. I pasted the histogram plus a representative sample of each rule’s output and asked it to group the violations into three buckets: mechanical (safe to bulk-fix, like casing and FQCN), semantic (needs a judgment call, like no-changed-when), and dangerous (touches shell behavior or secrets). That triage map became my plan of attack. The AI is great at this part because it’s reading docs and patterns faster than I can, but I treat its bucketing as a proposal, not a verdict.
The Violations That Actually Matter
Let’s look at real output instead of hand-waving. Here’s risky-shell-pipe, which is the one I care about most:
risky-shell-pipe: Shells that use pipes should set the pipefail option.
deploy.yml:34 Task/Handler: Fetch and load image
# before
- name: fetch and load image
shell: curl -sSL https://registry.internal/image.tar | docker load
A pipeline like this returns the exit code of the last command. If curl 404s, docker load still “succeeds” on empty input and Ansible marches on. The fix is real, not cosmetic:
# after
- name: Fetch and load image
ansible.builtin.shell: |
set -o pipefail
curl -sSL https://registry.internal/image.tar | docker load
args:
executable: /bin/bash
changed_when: true
That single change quietly fixed three rules at once: risky-shell-pipe, name[casing] (capitalized the task name), and fqcn[action-core] (shell to ansible.builtin.shell).
no-changed-when is the sneaky one. AI loves to “fix” it by slapping changed_when: false on everything, which is wrong if the task actually changes state. I tell it explicitly: only set changed_when: false on read-only commands; for anything that mutates state, derive changed_when from the command’s output or rc.
no-changed-when: Commands should not change things if nothing needs doing.
restart.yml:12 Task/Handler: Restart application
# after — derived, not blindly silenced
- name: Restart application
ansible.builtin.command: systemctl restart myapp
register: restart_result
changed_when: restart_result.rc == 0
Pro Tip: When an AI proposes changed_when: false, make it justify the choice in one sentence per task. “This only reads state” is acceptable; “this makes the warning go away” means you just lied to your idempotency reporting and future-you will pay for it during a --check run.
A Sane .ansible-lint Config
You will not fix 400 things in one sitting, and you shouldn’t pretend to. The trick is to fail CI on the rules that matter today while parking the rest as warnings you can burn down over time. This is where .ansible-lint earns its keep.
# .ansible-lint
---
profile: moderate
exclude_paths:
- .cache/
- .github/
- molecule/
- tests/fixtures/
- "group_vars/all/vault.yml"
# These FAIL the build. Keep this list honest and small at first.
# (no override needed — they fail by default under the profile)
# These only WARN. Burn them down, then promote to failures.
warn_list:
- yaml[line-length]
- var-naming[pattern]
- name[template]
# These we have consciously decided to ignore. Each one is a debt.
skip_list:
- schema[meta] # legacy roles, no galaxy publish planned
- role-name[path] # vendored roles we don't rename
A few opinions baked in here. exclude_paths lists the vault file explicitly, which matters in the next section. warn_list is your runway: rules land there, you fix them, then you move them out so they fail. skip_list is a graveyard, and every entry should embarrass you slightly. I had the AI draft this file from my histogram, then I rewrote half of it, because the AI defaulted to skipping things I’d rather fix. Quoting matters too: note "group_vars/all/vault.yml" is quoted because the path is fine unquoted but I keep all paths uniform, and any value with a colon (like a URL in a comment) must be quoted to stay valid YAML. No tabs, ever.
Profiles: Aim Higher Than You Think
ansible-lint ships escalating profiles: min, basic, moderate, safety, shared, production. They’re cumulative; each one turns on more rules. Most legacy repos can’t survive production on day one, so I start at moderate and ratchet up.
# See exactly which rules a profile adds before you commit to it
ansible-lint --profile production --list-rules
The endgame is profile: production in your config. That profile demands FQCN everywhere, named tasks, no risky shells, and forbids latest package pins. It’s strict on purpose; if you can pass it, your playbooks are genuinely portable. I let the AI tell me the delta between moderate and production for my specific repo, which turned a scary “turn on production mode” into a concrete, finite checklist of maybe 30 tasks.
Autofix, Carefully
ansible-lint --fix (formerly --write) will rewrite files in place for the rules that support it: FQCN, casing, YAML formatting. It is genuinely useful and it is also a loaded gun.
# Scope it. Don't unleash --fix on the whole tree blindly.
ansible-lint --fix=fqcn,name[casing] roles/web/ playbooks/deploy.yml
My hard rules for autofix, learned the painful way:
- Run it on a clean git tree so
git diffshows you exactly what changed. If you can’t read the diff, you can’t approve it. - Never run
--fixover vault-encrypted files or heavily templated Jinja. The formatter can mangle{{ }}blocks and it has no business rewriting ciphertext. That’s what theexclude_pathsentry above is for. - Fix one rule family per commit. “fqcn across the repo” is a reviewable commit. “fixed the linter” is not.
After every --fix batch I run the linter again and run check-mode against a staging inventory:
ansible-lint roles/web/ && \
ansible-playbook -i inventory/staging playbooks/deploy.yml --check --diff
If --check shows unexpected changed lines, a “formatting-only” fix wasn’t formatting-only. Catch that in staging, not at 2 a.m.
Pro Tip: AI and --fix are different tools. Use --fix for the mechanical rules it supports, and save the AI for the semantic ones (no-changed-when, risky-shell-pipe logic, var-naming) where a human-readable rationale matters. Don’t ask the AI to reproduce what a deterministic autofixer already does perfectly.
Wire It Into CI So It Stays Clean
A clean repo rots in a week without a gate. I add ansible-lint in two places: a pre-commit hook for fast local feedback, and a CI job that’s the actual source of truth.
# .pre-commit-config.yaml
---
repos:
- repo: https://github.com/ansible/ansible-lint
rev: v25.2.0
hooks:
- id: ansible-lint
# .github/workflows/lint.yml
---
name: ansible-lint
on:
pull_request:
push:
branches: ["main"]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run ansible-lint
uses: ansible/ansible-lint@v25
with:
args: ""
The CI job reads .ansible-lint, so your warn_list and skip_list apply automatically. Warnings show up in logs without failing the build, which gives you a visible, shrinking backlog. When you’ve cleared a rule out of warn_list, the build starts failing on regressions, and the cleanup sticks.
Where AI Helps and Where It Must Not
The throughline of this whole project: AI is a fast junior engineer. It triaged 400 violations into a sane plan in minutes, drafted my config, and wrote per-rule fixes with explanations I could check. That’s enormous leverage. If you want to push that further, a structured review workflow like the one behind our code review dashboard catches the same classes of issues on every PR, and a curated prompt pack saves you from re-explaining “derive changed_when, don’t silence it” for the hundredth time.
But the junior doesn’t get the keys to everything. Concretely: I never hand the AI the vault keys, I never let it (or --fix) touch encrypted files, and I review every single change before it lands. After each batch I re-run ansible-lint and --check mode myself, because the model is confident in ways that don’t always survive contact with a real inventory. Tools like Claude or GitHub Copilot are excellent at proposing the fix; they are not accountable for the 3 a.m. page when it’s wrong. You are.
Conclusion
A 400-line ansible-lint report isn’t a reason to disable the linter, it’s a map. Get a histogram, triage with AI into mechanical, semantic, and dangerous buckets, write an honest .ansible-lint with a warn_list you actually intend to drain, autofix the mechanical rules on a clean tree, and gate everything in CI. Let AI move fast on the proposals; you stay slow and deliberate on the review, the dry runs, and anything that smells like a secret. Two weeks later my “don’t run the linter” repo passes the production profile, and the Slack channel is quiet. Browse more IaC write-ups if you want the same treatment for the rest of your stack.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.