Rotating Ansible Vault Keys at Scale Without Downtime

A senior platform engineer left on a Friday. By the following Monday I was staring at a Slack thread that started with “hey, did anyone rotate the vault password after the offboarding?” and ended with my stomach somewhere around my shoes. The answer was no. That single password unlocked forty-something encrypted files spread across staging, prod, and a graveyard of half-decommissioned environments. The departing engineer had the password in a personal notes app, a laptop, who knows where else. We had no evidence of misuse, but “no evidence” is not “no breach.” We had to assume it was compromised and rekey everything, fast.

This is the post I wish I’d had that weekend. It is not the “what is Ansible Vault” post — that ground is already covered. This is specifically about rotating the encryption keys across many files and environments at scale, doing it without taking down deploys, and using AI to plan and script the work without ever handing it the actual secrets.

The rule that makes everything else safe

Before a single command, the operating principle: AI plans and scripts the rotation. Humans hold the keys.

I treat AI like a sharp junior engineer who just joined the team. Fast, tireless, great at writing the loop that finds every encrypted file and generating the bash to rekey them — and absolutely not someone I hand the production vault password to on day one. The AI sees the plan: the file inventory, the command shapes, the CI wiring, the rollback steps. It never sees the old password, the new password, or a single decrypted secret value.

Concretely, that means I never paste decrypted output into a chat window, never put a real password in a prompt, and never let an AI agent execute against prod with the vault password in its environment. The AI writes the script; a human runs it, locally, after reading every line. If you want a second set of eyes on the generated diff before it touches anything, that is exactly what a tool like the code review dashboard is for.

Find every encrypted file first

You cannot rotate what you cannot see. Ansible Vault files all start with the same header line, whether they are whole-file vaults or embedded encrypted strings, so a recursive grep is your inventory:

# Whole-file vaults: the first line is the magic header
grep -rl '\$ANSIBLE_VAULT' . --include='*.yml' --include='*.yaml'

# Inline encrypted vars (!vault tagged) live inside otherwise-plaintext files
grep -rln '!vault |' . --include='*.yml' --include='*.yaml'

The distinction matters enormously for rotation. A whole-file vault is encrypted end to end and rekeys cleanly with one command. An inline encrypted variable (my_secret: !vault | $ANSIBLE_VAULT...) lives inside a file full of plaintext, and ansible-vault rekey will refuse it. Those have to be decrypted and re-encrypted value by value with ansible-vault encrypt_string — which means a human and the password, never the AI.

This is a perfect AI task: “Here’s the output of that grep across the repo. Build me a manifest table — file path, environment (infer from path), whole-file vs inline, and which vault-id label it should use.” It produces the inventory; you sanity-check it. No secrets involved, just paths and structure.

Rekey whole-file vaults in a batch

For a single file, rekeying is one command — it decrypts with the old password and re-encrypts with the new one in place:

ansible-vault rekey \
  --vault-id @prompt \
  --new-vault-id @prompt \
  group_vars/prod/secrets.yml

At scale you do not want forty interactive prompts. Point the old and new passwords at files (mode 0600, never committed) and loop over the inventory:

# old.pass and new.pass exist only on the operator's machine, 0600, gitignored
while IFS= read -r f; do
  echo "Rekeying ${f}"
  ansible-vault rekey \
    --vault-id "old@./old.pass" \
    --new-vault-id "prod@./new.pass" \
    "${f}" || { echo "FAILED: ${f}"; break; }
done < whole_file_vaults.txt

That loop is exactly the kind of thing I have AI generate — the error handling, the break on first failure so you do not blow through forty files with a bad password, the logging. I review it, then I create old.pass and new.pass and run it. The AI authored the script; it never learned what is inside either file.

Pro Tip: Always run your inventory grep again after the loop, this time looking for files still readable with the OLD password. If ansible-vault view --vault-id old@./old.pass <file> still succeeds on anything, that file was missed. Zero successes is your proof of completion.

Use vault IDs so multiple keys can coexist

The reason you can rotate without downtime is vault IDs. A vault ID is a labelled password source (--vault-id label@source), and Ansible can hold several at once. During a rotation window you keep both the old and new key available, so playbooks keep decrypting whether a given file has been rekeyed yet or not:

ansible-playbook site.yml \
  --vault-id prod_old@~/.vault/prod_old.pass \
  --vault-id prod_new@~/.vault/prod_new.pass

Ansible tries each provided ID until one decrypts the file. That overlap is the trick: you rekey files in batches, deploys never break mid-rotation, and once everything is on prod_new you drop the prod_old ID entirely. It is a graceful cutover instead of a flag day.

When you encrypt with a label, that label gets baked into the file header, so future operators know which key any file expects:

ansible-vault encrypt --vault-id prod_new@~/.vault/prod_new.pass group_vars/prod/new_secrets.yml

If you run distinct keys per environment — and you should, so a staging leak never touches prod — your config makes the mapping explicit:

# ansible.cfg
[defaults]
vault_identity_list = staging@~/.vault/staging.pass, prod@~/.vault/prod.pass

Wire the password into CI without leaking it

CI cannot answer an interactive prompt, so the vault password reaches it one of two ways: a file, or an environment variable that a tiny script echoes back.

# vault-pass-client.sh — Ansible runs this and reads the password from stdout
#!/usr/bin/env bash
echo "${ANSIBLE_VAULT_PASSWORD}"

# GitHub Actions step
- name: Deploy
  env:
    ANSIBLE_VAULT_PASSWORD: ${{ secrets.PROD_VAULT_PASSWORD }}
  run: |
    chmod +x vault-pass-client.sh
    ansible-playbook site.yml --vault-password-file ./vault-pass-client.sh

During rotation, the only change here is swapping the value in your CI secret store from the old password to the new one — after every file is rekeyed. Two notes that have burned me: an executable --vault-password-file must print only the password and nothing else (no debug echo, no trailing banner), and a plain-text password file must be 0600 and absolutely outside the repo. The AI is great at writing the client script and the workflow YAML; it never gets the secret, because the secret lives in GitHub’s encrypted store and gets injected at runtime, long after the prompt is forgotten.

Pro Tip: Rotate the CI-stored password as its own discrete step, and trigger one throwaway deploy immediately after to confirm the new key decrypts in the pipeline. If that smoke deploy fails, you’ve caught it before a real release does — and you still have the old key around to roll back.

Dry-run everything, then audit

No rekey loop touches prod until it has run against a throwaway copy first. Snapshot the encrypted files, point the loop at the copy, and confirm both the rekey and a subsequent --check deploy succeed:

cp -r group_vars /tmp/vault-dryrun && cd /tmp/vault-dryrun
# run the rekey loop here against the copies
ansible-playbook site.yml --check --diff \
  --vault-id prod_new@~/.vault/prod_new.pass

--check runs the playbook in no-op mode; if the new key cannot decrypt something, it fails here, safely, on a copy. Only then do I run for real.

For the audit trail — the thing the security team will actually ask for — I capture what changed, never what the values are:

git log --oneline -- group_vars/ | head -20
git diff HEAD~1 -- group_vars/prod/secrets.yml | head -5   # ciphertext only

A vault diff shows changed ciphertext, which proves a file was re-encrypted without revealing a single plaintext byte. That is the perfect artifact: provable rotation, zero secret exposure. I have AI draft the rotation runbook and the post-incident summary from this metadata — file count, timestamps, environments, who ran each batch — because none of that is sensitive. The same boundary applies if you lean on a prompt workspace to standardize these runbooks: feed it the structure, never the secrets.

Where AI actually earns its keep

Across this whole rotation, AI did the tedious, error-prone authoring: the inventory grep, the batched rekey loop with proper failure handling, the CI client script, the dry-run harness, and the runbook. What it never did was hold a key or see a value. If you want patterns for scoping these requests safely, the prompts library and the curated prompt packs are a good starting point, and the broader IaC category collects the rest of this playbook. Whichever assistant you reach for — Claude is my default for this kind of scripting — the boundary is the same: it drafts, you decide, and the password never leaves your hands.

Conclusion

We finished that emergency rotation in an afternoon, not because we panicked faster but because the work decomposed cleanly: inventory, batch rekey with overlapping vault IDs, CI swap, dry-run, audit. AI compressed every one of those steps from hours of careful bash into minutes of review. The reason I trust it is precisely because it never touches the keys. AI sees the plan. You hold the secret. That split is not a limitation to work around — it is the entire design, and it is what lets you rotate at scale without ever losing sleep over what the model might have seen.