AI-Assisted Ansible: Debugging Become and Connection Failures
Decode Ansible UNREACHABLE errors, sudo prompts, become_method, ProxyJump, and host key failures faster, with AI drafting fixes while you stay in control.
- #ansible
- #ai
- #ssh
- #privilege-escalation
- #debugging
It was a Tuesday rollout and the playbook had run clean against staging for a week. Then I pointed it at the new OpenStack compute rack and the whole thing fell over on the first task: UNREACHABLE! => SSH Error: Permission denied (publickey). Five minutes later, after I’d swapped the inventory user, it failed differently: Missing sudo password. Connection and privilege-escalation bugs are the two oldest dogs in the Ansible kennel, and they always bite at the worst time. What’s changed for me is that I no longer brute-force these alone. I paste the raw -vvv output into an AI, let it triage the likely causes, and then I verify every suggestion against the actual host before I trust it. The model is fast at decoding cryptic output and drafting config; it is not the one with root on the box. That distinction matters.
UNREACHABLE Is a Connection Problem, Not a Become Problem
The first thing AI gets right that tired engineers get wrong: UNREACHABLE means Ansible never got a working shell. It has nothing to do with sudo. If you’re staring at a permission-denied UNREACHABLE, stop editing your become settings and look at SSH.
Run the task with full verbosity and grab the exact SSH command Ansible built:
ansible-playbook site.yml -l compute-07 -vvvv
The -vvvv output prints the literal ssh invocation. That’s gold for an AI prompt, because the model can read the args the same way you would. Here’s the kind of prompt I use:
I’m getting
UNREACHABLE! Permission denied (publickey)from Ansible. Here is the exact SSH command from-vvvv:ssh -o ControlMaster=auto -o ControlPath=... -o User=deploy -o ConnectTimeout=10 compute-07 .... The host is reachable withssh deploy@compute-07from my shell. What’s the most likely difference between my manual SSH and Ansible’s, and how do I confirm it before changing anything?
A good model immediately points at the usual suspects: a different user, a missing -i identity file, an SSH agent that your interactive shell has but the Ansible process doesn’t, or a stale ControlPath socket. The verification step it should give you is to run that exact printed command yourself. If your manual ssh works but Ansible’s printed command fails, the delta is in the args, not the network.
Nine times out of ten on a fresh fleet it’s the identity file. Pin it explicitly so there’s no ambiguity:
# ansible.cfg
[defaults]
inventory = ./inventory.ini
host_key_checking = True
remote_user = deploy
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
private_key_file = ~/.ssh/deploy_ed25519
pipelining = True
pipelining = True is worth calling out: it cuts the number of SSH operations per task and sidesteps a class of become quirks, but it needs requiretty disabled in /etc/sudoers on the target. AI will remind you of that prerequisite if you ask, but you still have to check the sudoers file yourself.
Host Key Failures Deserve Honesty, Not host_key_checking = False
The lazy fix for Host key verification failed is to set host_key_checking = False, and AI will happily generate that line if you ask it to “make the error go away.” Don’t. That’s how you teach a fleet to accept man-in-the-middle keys silently. When I describe the situation honestly, a decent model pushes back and offers the right tool instead.
For freshly provisioned hosts where the key legitimately changed, scan and pin the new keys rather than disabling checking globally:
ssh-keygen -R compute-07
ssh-keyscan -t ed25519 compute-07 >> ~/.ssh/known_hosts
Better, in an immutable-infrastructure flow, capture host keys at provision time and distribute a managed known_hosts. If you genuinely must relax checking for a one-off bootstrap play, scope it to that play and nothing else:
- name: Bootstrap brand-new nodes
hosts: freshly_imaged
gather_facts: false
vars:
ansible_ssh_common_args: "-o StrictHostKeyChecking=accept-new"
tasks:
- name: Wait for SSH
ansible.builtin.wait_for_connection:
timeout: 120
accept-new trusts unknown hosts but still rejects a changed key, which is the honest middle ground. The broader pattern here is one I lean on constantly: AI drafts the convenient answer, and your job is to ask “what does this actually weaken?” If you want a starting library of prompts that bake in that skepticism, the prompt collection has a few aimed at security review of generated config.
ProxyJump and Bastions: Where Manual SSH and Ansible Diverge
Bastion hosts are the single biggest source of “works in my terminal, fails in Ansible” tickets I see, because your ~/.ssh/config ProxyJump lines are invisible to a CI runner that doesn’t read your personal config. The fix is to push the jump into Ansible’s own SSH args so it’s reproducible everywhere.
# inventory.ini
[compute]
compute-07 ansible_host=10.20.0.7
[compute:vars]
ansible_user=deploy
ansible_ssh_common_args='-o ProxyJump=jump@bastion.example.net:22 -o StrictHostKeyChecking=accept-new'
When this breaks, the error is often a misleading UNREACHABLE that’s really the bastion rejecting you, not the target rejecting Ansible. I’ll hand the AI both the -vvvv block and the topology — “connection goes through a single bastion, key auth on both hops” — and ask it to tell me which hop failed. The tell it should surface: if the SSH command shows the ProxyJump but dies before reaching the target’s auth, the bastion hop is the problem. Confirm by jumping manually with the exact same flag:
ssh -o ProxyJump=jump@bastion.example.net deploy@10.20.0.7 'hostname'
If that one-liner fails, no Ansible change will save you, and AI that’s worth its tokens will say so rather than inventing a become workaround.
Now the Become Half: Sudo Passwords and Methods
Once you have a working shell, privilege escalation is its own layer. Missing sudo password means your task asked for root via sudo and the account isn’t passwordless. The cleanest path on managed nodes is a scoped NOPASSWD sudoers entry for the deploy user, but if policy requires a password, supply it at runtime instead of hardcoding it:
ansible-playbook site.yml --ask-become-pass
For non-interactive CI, pull the secret from a vault rather than the inventory:
- name: Configure web tier
hosts: web
become: true
become_method: sudo
vars:
ansible_become_pass: "{{ vault_become_password }}"
tasks:
- name: Ensure nginx is present
ansible.builtin.package:
name: nginx
state: present
The subtler failures come from become_method. I’ve watched a playbook fail because a hardened host used doas or required su, and the default sudo method produced a vague “Authentication failure” that sent a teammate down a two-hour SSH rabbit hole. This is exactly the kind of misdirection where AI earns its keep — feed it the error plus become_method: sudo and the fact that the host has no sudo binary, and it’ll redirect you to the connection-vs-escalation split immediately:
Your error is at the escalation layer, not SSH — you already have a shell or you’d see UNREACHABLE. “Authentication failure” with
become_method: sudoon a host without sudo usually means the method is wrong. Trybecome_method: suwithbecome_user: rootandbecome_pass, ordoasif that’s what’s installed. Verify by runningansible host -m raw -a 'which sudo doas su'before changing the play.
That -m raw probe is the verification I always run, because raw doesn’t need Python or facts and tells you exactly which escalation tools exist on the box.
Keep the Loop Tight, Keep Yourself in Control
My working rhythm for these bugs is the same one I described in more depth in debugging Ansible failures faster with AI: reproduce with -vvvv, paste the raw output and the topology into the model, let it split the problem into connection versus escalation, and then run the smallest possible probe — a manual ssh, an ssh-keyscan, an -m raw command — to confirm before I touch the playbook. The AI compresses the decode-and-draft phase from twenty minutes to two. It does not get to skip the probe, and neither do I.
The trap with both connection and become bugs is that the error messages lie about which layer failed, and a confident model will sometimes lie right along with them. So I treat every generated ansible.cfg line and every sudoers suggestion as a hypothesis, not a patch. Verify against the host, then commit. If you want more Ansible-specific workflows in this vein, the Ansible category collects them. Used this way, AI turns the two most frustrating classes of Ansible failure into a quick triage — and you stay the one holding root.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.