Pull-Based Config Management with ansible-pull

The page that woke me up at 2 a.m. wasn’t an outage. It was our control node quietly melting under its own ambition. We’d grown from forty long-lived VMs to a fleet of a few thousand autoscaling instances that spun up, did work, and vanished within the hour. Our nightly ansible-playbook -i inventory site.yml run — once a tidy six minutes — now took longer than some nodes existed. By the time the push reached a host, that host had already been recycled. We were configuring ghosts.

That night I stopped trying to make the push model faster and started asking a different question: what if every node configured itself?

Push vs Pull: A Quick Mental Model

Classic Ansible is push. A control node holds the inventory, opens SSH to each target, and shoves configuration outward. It’s wonderfully simple for a stable fleet you can enumerate. But it has a hard ceiling: the control node is a bottleneck, it needs network reachability to every host, and it has to know the host exists before it can configure it. Ephemeral and edge nodes break all three assumptions.

Pull inverts the flow. Each node periodically reaches out to a git repo, checks out the playbooks, and runs ansible-playbook against localhost. No central scheduler. No inbound SSH. No inventory that has to keep up with autoscaling. The node is responsible for its own state, and the only thing it needs to reach is a git remote — which scales far more gracefully than N SSH sessions.

That’s ansible-pull. It ships in the box with Ansible and does exactly this.

When Pull Actually Wins

Pull isn’t universally better — it’s better for specific shapes of infrastructure:

Large fleets where opening thousands of SSH sessions from one box is the bottleneck.
Autoscaling nodes that must be fully configured by the time they pass a health check, with no human or scheduler in the loop.
Edge and intermittently-connected hosts behind NAT or flaky links, where inbound SSH from a central controller isn’t realistic.
No-control-node mandates — environments where standing up and securing an AWX/Tower box is more operational weight than you want.

If you have a stable, enumerable fleet and you value the orchestration features of push (ordered rollouts, gated batches, serial), stay with push. Pull trades central coordination for autonomy.

The local.yml Convention

ansible-pull has one strong convention: by default it looks for a playbook named local.yml at the root of the checked-out repo. Honor it and the command line stays short.

---
- name: "Base configuration (runs on every node)"
  hosts: localhost
  connection: local
  become: true
  gather_facts: true

  vars:
    managed_packages:
      - "chrony"
      - "rsyslog"
      - "curl"

  tasks:
    - name: "Install baseline packages"
      ansible.builtin.package:
        name: "{{ managed_packages }}"
        state: present

    - name: "Ensure chrony is running"
      ansible.builtin.service:
        name: "chronyd"
        state: started
        enabled: true

    - name: "Drop a managed marker file"
      ansible.builtin.copy:
        dest: "/etc/ansible-pull-managed"
        content: "Managed by ansible-pull on {{ ansible_date_time.iso8601 }}\n"
        owner: "root"
        group: "root"
        mode: "0644"

Note connection: local and hosts: localhost — the node is talking to itself. Everything runs over the loopback, no SSH involved.

Host-Specific Playbooks

You rarely want every node to be identical. ansible-pull lets you key playbooks off the hostname. A common pattern is to have local.yml import a host-specific file, or to use the --directory/inventory mechanism. The cleanest approach I’ve landed on is an import_playbook driven by group membership baked into a local inventory or fact:

---
- name: "Apply baseline everywhere"
  import_playbook: "baseline.yml"

- name: "Apply role-specific config"
  import_playbook: "roles-{{ lookup('ansible.builtin.env', 'NODE_ROLE') | default('generic', true) }}.yml"

Here NODE_ROLE comes from cloud-init user-data (more on that below), so a web node pulls roles-web.yml and a worker node pulls roles-worker.yml from the same repo.

The Git Checkout Flow and —only-if-changed

A bare invocation looks like this:

ansible-pull \
  --url "https://git.example.com/infra/fleet-config.git" \
  --checkout "main" \
  --directory "/var/lib/ansible-pull/repo" \
  --only-if-changed \
  local.yml

ansible-pull clones (or fast-forwards) the repo into --directory, then runs the playbook. The flag that matters most for fleet hygiene is --only-if-changed: it skips the playbook run entirely if the git checkout produced no new commits. Without it, every node re-runs the full playbook on every tick — usually harmless thanks to idempotency, but a needless load spike across thousands of hosts and a lot of log noise. With it, nodes stay quiet until you actually push a change.

Pro Tip: pin --checkout to a branch or tag you control your rollout with, not a fast-moving main. Merge to main, soak on a canary branch, then move the prod tag. A pull fleet rolls out as fast as nodes tick — so the git ref is your release gate.

Scheduling: cron vs systemd Timer

You need something to run ansible-pull on a schedule. Cron works, but on modern systems a systemd timer gives you jitter, logging through the journal, and clean status inspection.

/etc/systemd/system/ansible-pull.service:

[Unit]
Description: "Run ansible-pull to converge local config"
Wants: network-online.target
After: network-online.target

[Service]
Type: oneshot
Environment: "NODE_ROLE=generic"
ExecStart: /usr/bin/ansible-pull \
  --url "https://git.example.com/infra/fleet-config.git" \
  --checkout "prod" \
  --directory "/var/lib/ansible-pull/repo" \
  --only-if-changed \
  local.yml
TimeoutStartSec: 600

/etc/systemd/system/ansible-pull.timer:

[Unit]
Description: "Periodic ansible-pull convergence"

[Timer]
OnBootSec: 120
OnUnitActiveSec: 900
RandomizedDelaySec: 300
Persistent: true

[Install]
WantedBy: timers.target

Enable it with:

sudo systemctl daemon-reload
sudo systemctl enable --now ansible-pull.timer
systemctl list-timers ansible-pull.timer

RandomizedDelaySec is doing quiet heavy lifting here: it spreads thousands of nodes across a five-minute window so they don’t all hammer the git remote at the same instant. OnBootSec guarantees a convergence shortly after boot, and OnUnitActiveSec keeps it ticking every fifteen minutes after that.

Bootstrapping with cloud-init

The whole point is that a brand-new node configures itself with zero manual touch. cloud-init user-data is where you plant the seed: install Ansible, write the timer, and kick the first run.

#cloud-config
package_update: true
packages:
  - "ansible"
  - "git"

write_files:
  - path: "/etc/ansible-pull/role.env"
    permissions: "0644"
    content: |
      NODE_ROLE=web

runcmd:
  - ["ansible-pull",
     "--url", "https://git.example.com/infra/fleet-config.git",
     "--checkout", "prod",
     "--directory", "/var/lib/ansible-pull/repo",
     "local.yml"]
  - ["systemctl", "enable", "--now", "ansible-pull.timer"]

The first ansible-pull runs synchronously during boot (so the node is configured before it joins the load balancer pool), and the timer takes over for ongoing convergence. The local.yml itself can lay down the actual .service and .timer files, so cloud-init only has to bootstrap the very first pull.

Idempotency Is Non-Negotiable

With push, an awkward task that re-does work each run is a minor annoyance you’ll catch during a manual run. With pull, that same task runs unattended on every node every fifteen minutes, forever. A non-idempotent command that restarts a service “just in case” becomes a fleet-wide flap.

Treat idempotency as a hard requirement:

Prefer real modules (ansible.builtin.package, template, service) over command/shell.
When you must shell out, gate it with creates:, removes:, or a changed_when you actually reason about.
Run check-mode before every rollout: ansible-pull ... --check --diff local.yml previews what would change without touching the node.

Observability: Knowing It Actually Ran

Pull’s weakness is the flip side of its strength: there’s no central run that tells you the fleet converged. You have to make nodes report back. A few patterns that work:

Callback plugins that POST run results to a webhook or a metrics endpoint at the end of each play.
A final task that pushes a heartbeat (last-converged timestamp, git SHA, changed-task count) to a time-series database or object store.
The systemd journal plus systemctl is-failed ansible-pull.service, scraped by your node-level agent.

    - name: "Report convergence result"
      ansible.builtin.uri:
        url: "https://telemetry.example.com/ansible-pull/report"
        method: "POST"
        body_format: "json"
        body:
          host: "{{ ansible_hostname }}"
          git_sha: "{{ lookup('ansible.builtin.pipe', 'git -C /var/lib/ansible-pull/repo rev-parse HEAD') }}"
          status: "ok"
      delegate_to: localhost
      changed_when: false

Feed those reports into your alerting so a node that hasn’t converged in an hour shows up the same way a failed deploy would. If you want a worked example of turning fleet signals into actionable alerts, our monitoring alerts workspace walks through the deterministic rule scaffolding, and the incident response dashboard covers what to do when a convergence storm goes sideways.

Where AI Fits — As a Fast Junior Engineer

The most tedious part of standing this up is the boilerplate: a correct local.yml, a systemd unit that won’t silently fail on network-online.target, and a cloud-init block with the quoting exactly right. This is precisely where an AI assistant earns its keep. Hand it your fleet’s package list and node roles and it will draft the playbook, the unit, and the timer in seconds — the scaffolding I used to copy-paste from old repos.

But treat it like a fast junior engineer, not an oracle:

A human reviews every change. Generated YAML compiles fine and still does the wrong thing — wrong become, a missing enabled: true, an over-eager shell task with no creates:.
Always run check-mode/dry-run. --check --diff on a canary before the ref moves to prod. The AI cannot tell you what its playbook does to your actual hosts; only a dry run can.
Never hand the AI your vault keys. Let it scaffold the structure of ansible-vault-encrypted vars, but the decryption password and any real secrets stay out of every prompt, log, and context window. Full stop.

A good loop is to draft in a prompt workspace, then push the result through a code review pass before it touches a node. If you’d rather start from vetted prompts, the IaC category and our prompt packs have ansible-pull scaffolding prompts ready to adapt.

Trade-offs vs Push (AWX/Tower)

Pull is not a free lunch. AWX/Tower give you a UI, RBAC, scheduled job templates, surveys, and centralized run history out of the box — things you have to assemble yourself in a pull world. Push also makes ordered, gated rollouts trivial (serial: 10%, batches, pauses for approval); pull rolls out as fast as nodes happen to tick, so your git ref discipline becomes your only safety rail. And debugging a misbehaving node is a little more involved when there’s no central run log to read.

The honest summary: choose pull when autonomy and scale matter more than central orchestration, and push when you need a human-in-the-loop conductor for every rollout. Plenty of mature shops run both — push for the stable control plane, pull for the ephemeral fleet.

Conclusion

That 2 a.m. page never came back, because there was no longer a single control node to overload. Each instance now wakes up, pulls its config, converges, reports home, and gets recycled without anyone watching. ansible-pull didn’t make Ansible faster — it changed who’s responsible for the work. Let AI sweep up the boilerplate, keep a human on every diff, dry-run before every rollout, and let your nodes configure themselves.

Pull-Based Config Management with ansible-pull: Self-Configuring Fleets at Scale