Automating OpenStack with the Python SDK and CLI

The fastest way to tell how mature an OpenStack operation is: do they automate against the API, or do they click through Horizon? Horizon is fine for inspection and one-offs, but every repeatable operation — provisioning fleets, auditing quotas, cleaning up orphans — belongs in code. OpenStack has a genuinely good automation story through the unified CLI and the Python openstacksdk, and learning it pays back fast.

I’ve automated OpenStack operations for years across multiple clouds. Here’s the toolkit I actually reach for and the patterns that keep the automation safe.

clouds.yaml: stop pasting credentials

The first thing to fix is authentication. Sourcing openrc files and exporting OS_* env vars everywhere is fragile and leaks secrets into shell history. Use clouds.yaml instead — a single file (in ~/.config/openstack/ or /etc/openstack/) defining named clouds:

# ~/.config/openstack/clouds.yaml
clouds:
  prod:
    auth:
      auth_url: https://keystone.prod.example.com/v3
      username: automation
      password: "{{ from_secrets_manager }}"
      project_name: ops
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne
    interface: public
    identity_api_version: 3

Now every tool just references the cloud by name:

openstack --os-cloud prod server list

And in Python:

import openstack
conn = openstack.connect(cloud="prod")

One source of truth, no exported secrets, trivial to switch between clouds. This alone cleans up most automation messes I inherit.

The unified CLI for shell automation

The openstack command is consistent enough that shell scripting against it is pleasant — as long as you ask for machine-readable output. Never parse the default table format; use -f json or -f value:

# Get just the IDs of all ERROR-state instances
openstack --os-cloud prod server list --status ERROR -f value -c ID

# Loop over them safely
for id in $(openstack --os-cloud prod server list --status ERROR -f value -c ID); do
  echo "Would reset $id"   # dry run first, always
done

The -f value -c <column> combination is the workhorse — it gives you clean lines with no headers to strip. For anything structured, -f json and pipe to jq. Parsing the pretty table is the number-one fragile-automation mistake I see.

The Python SDK for real programs

For anything beyond a loop, the openstacksdk is far better than shelling out. It’s idempotent-friendly, handles pagination, and returns real objects:

import openstack

conn = openstack.connect(cloud="prod")

# Idempotent: create only if it doesn't exist
image = conn.image.find_image("ubuntu-22.04")
flavor = conn.compute.find_flavor("m1.small")
network = conn.network.find_network("tenant-net")

server = conn.compute.create_server(
    name="web-01",
    image_id=image.id,
    flavor_id=flavor.id,
    networks=[{"uuid": network.id}],
)
conn.compute.wait_for_server(server)
print(server.access_ipv4)

The find_* methods returning None when nothing matches is what makes idempotent scripts clean — check, then create. The wait_for_server helper handles the polling you’d otherwise write by hand. The SDK also paginates transparently, so conn.compute.servers() iterates every server across pages without you tracking markers.

Patterns that keep automation safe

A few hard-won rules:

Dry run by default. Any script that deletes or mutates should print what it would do unless --apply is passed. Orphan-cleanup scripts that “just delete” are how you lose a tenant’s volumes.
Tag what you create. Set metadata/properties on resources your automation makes (created_by: automation, a run ID). Then cleanup can scope to exactly what it owns and never touches a human’s resources.
Use a dedicated service account with a scoped role, not your admin credentials. Federation or an application credential is ideal:
```
openstack application credential create automation-job \
  --role member --restricted
```
Application credentials can be revoked independently and don’t carry your full privileges.
Handle rate limits and retries. The SDK has retry config; large clouds will throttle you. Bulk operations should back off, not hammer.

Where AI fits

This is exactly the kind of work where AI accelerates without touching production: I describe the operation in plain English and get a first-draft SDK script or a jq filter, then I review it and add the dry-run guard myself. The model is good at the boilerplate — pagination, find-then-create, argument parsing — and I keep the judgment about what’s safe to run. I keep a set of OpenStack automation prompts (audit orphaned floating IPs, find instances without backups, report per-project quota usage) in our prompt library so I’m not re-deriving them each time.

Ansible for declarative infra

For declarative provisioning rather than imperative scripts, the openstack.cloud Ansible collection uses the same clouds.yaml and gives you idempotent modules:

- name: Ensure web network exists
  openstack.cloud.network:
    cloud: prod
    name: web-net
    state: present

I reach for Ansible when I want desired-state convergence and for the SDK when I want imperative logic. Both read the same auth config, so they coexist cleanly.

Where to go next

Automating OpenStack well comes down to a few habits: centralize auth in clouds.yaml, never parse the pretty CLI output, prefer the SDK for real logic, tag everything you create, and dry-run by default. Add a scoped application credential and you have safe, repeatable operations instead of Horizon clicking. For the services you’ll be automating against — Nova, Cinder, Neutron, and the rest — see the OpenStack category.

Automation that mutates a cloud can mutate it at scale. Always dry-run destructive scripts and scope service credentials before running them against production.