Advanced Cloud-init Recipes for Production Server

Everybody’s first cloud-init config installs nginx and writes a banner. That’s the tutorial. Production cloud-init is a different animal: ordering matters, modules run in specific boot stages, templating pulls in instance metadata, and when it breaks at scale you need to know exactly where to look. This is the stuff that comes after the hello-world — the recipes I actually use to bootstrap fleets.

How cloud-init really runs: the four boot stages

You can’t debug cloud-init without knowing its stages, because when a module runs determines what’s available to it:

Generator — decides whether cloud-init runs at all this boot.
Local — sets up networking. No network yet.
Network (config) — network is up; most config modules run here.
Final (final) — runs late: runcmd, package installs, user scripts.

The practical consequence: bootcmd runs very early, on every boot, while runcmd runs once in the final stage. Put network-dependent work in runcmd; put “set this kernel param before networking” work in bootcmd. Mixing them up is the source of half the “why did my script run twice / run too early” confusion.

write_files: configuration before services start

write_files lands files on disk before services that need them start. It handles permissions, ownership, base64, and gzip — far cleaner than echoing into files from runcmd.

#cloud-config
write_files:
  - path: /etc/app/config.yaml
    owner: app:app
    permissions: '0640'
    content: |
      server:
        port: 8080
        log_level: info
  - path: /etc/sysctl.d/99-tuning.conf
    permissions: '0644'
    content: |
      net.core.somaxconn = 65535
      vm.swappiness = 10
  - path: /opt/scripts/healthcheck.sh
    permissions: '0755'
    encoding: b64
    content: IyEvYmluL2Jhc2gKY3VybCAtZiBsb2NhbGhvc3Q6ODA4MC9oZWFsdGgK

Files exist before runcmd runs, so a service you start there finds its config already in place. This ordering guarantee is why write_files beats scripting file creation.

Multi-part user-data: mixing config and scripts

User-data can be a MIME multi-part archive combining a #cloud-config block, a shell script, and a boot hook. This is how you keep declarative config declarative while still running an arbitrary script:

Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/cloud-config

#cloud-config
packages:
  - docker.io
  - jq

--==BOUNDARY==
Content-Type: text/x-shellscript

#!/bin/bash
systemctl enable --now docker
docker run -d --restart=always myapp:latest
--==BOUNDARY==--

Build these with cloud-init devel make-mime rather than hand-crafting boundaries. The packages module handles the install declaratively; the script handles the imperative bits that don’t fit a module.

Jinja templating against instance metadata

Cloud-init can template user-data with Jinja, pulling in instance metadata so one config adapts per host. Start the file with ## template: jinja:

## template: jinja
#cloud-config
hostname: web-{{ v1.instance_id }}
write_files:
  - path: /etc/app/node.conf
    content: |
      region={{ v1.region }}
      availability_zone={{ ds.meta_data.placement.availability_zone }}
      local_ipv4={{ ds.meta_data.local_ipv4 }}

Now the same user-data produces correctly-named, region-aware hosts across your fleet without a templating step in your provisioning tooling. Available variables come from v1 (cloud-agnostic) and ds (datasource-specific) — inspect them with cloud-init query --all.

Idempotency and the per-instance vs per-boot trap

Cloud-init tracks whether it has run for a given instance. runcmd runs once per instance; reboot the same instance and it won’t re-run. But if you re-image or clone, the instance ID changes and it runs again. For things that genuinely must run every boot, use bootcmd — and make those commands idempotent, because they will run repeatedly.

The classic bug: putting a one-time database seed in bootcmd. It re-seeds on every reboot. Put one-time work in runcmd, per-boot work in bootcmd, and guard anything risky with a check.

Debugging without burning an afternoon

When a bootstrap fails, go straight to these:

# Did cloud-init finish, and how long did each stage take?
cloud-init analyze show

# Status and any fatal error
cloud-init status --long

# The logs — this is where the real error is
sudo cat /var/log/cloud-init.log
sudo cat /var/log/cloud-init-output.log   # stdout/stderr of your scripts

# Validate config syntax BEFORE you boot 50 instances
cloud-init schema --config-file user-data.yaml

cloud-init schema is the one most people skip and shouldn’t — it catches malformed YAML and unknown keys before you waste a fleet rollout. cloud-init-output.log is where your script’s actual error message lives; the main log is mostly module orchestration.

Where AI fits

Cloud-init’s module surface is large and the docs are scattered. I use an assistant to translate “bootstrap a Docker host with these sysctls, this config file, and a healthcheck” into a correct multi-part config, and to read a cloud-init.log traceback and tell me which module failed and why. Keep a few cloud-init prompts for generating write_files blocks and multi-part archives, then always run cloud-init schema on the result.

Production checklist

Validate with cloud-init schema in CI before any config ships.
Keep user-data small. Most clouds cap it (16KB on AWS). Pull large payloads from a script or object storage instead of inlining.
Make bootcmd idempotent — it runs every boot.
Don’t put secrets in user-data. It’s readable from instance metadata. Fetch secrets at boot from a secrets manager instead.
Test on the actual image. Cloud-init behavior varies by distro and datasource; a config that works on Ubuntu may not on Amazon Linux.

Cloud-init is the most boring-looking and most leverage-dense tool in the bootstrap stack. Master the stages, write_files, multi-part user-data, and the debug log paths, and first-boot provisioning stops being a mystery. For the broader picture, see our Infrastructure as Code guides.

Generated cloud-init configs are assistive, not authoritative. Validate with cloud-init schema and boot a single test instance before any fleet rollout.

Advanced Cloud-init Recipes for Production Server Bootstrapping