Planning OpenStack Upgrades Safely Without Downtime

OpenStack upgrades have a reputation, and it’s earned. The release cadence is fast, the services are tightly coupled through a shared message bus and databases, and a single skipped migration can wedge the control plane. After upgrading production clouds across many releases, I’ve learned the failures are almost never exotic — they’re boring details executed in the wrong order.

Here’s the plan I use to upgrade without taking the cloud down.

The rule: never skip a release

OpenStack supports upgrading one release at a time, and increasingly supports “skip-level” (SLURP) upgrades between designated releases. But unless you’re explicitly on a SLURP-to-SLURP path, upgrade sequentially. The RPC and object-version compatibility windows are designed for N to N+1. Jumping Caracal straight to two releases later, off a SLURP boundary, is how you discover undocumented migration gaps the hard way.

Map your path before anything else: current release, target release, and every release in between.

Step 1: Read the release notes like they’re a contract

Every release has upgrade notes per service. They tell you about removed config options, required DB migrations, and deprecations that became removals. Skipping this step is the most expensive shortcut in OpenStack operations.

Build a per-service change list: Keystone first (everything depends on it), then Glance, Nova, Neutron, Cinder, and the rest. Note every config key that’s renamed or removed — those are silent breakers.

Step 2: Understand the rolling-upgrade contract

Modern OpenStack services support rolling upgrades through three mechanisms you must respect:

DB schema is expand/contract. New code reads old and new schema. You run schema expand migrations while old services still run, deploy new code, then run contract later.
RPC version pinning. During the rollout, you pin the new services to speak the old RPC version so old and new agents coexist. For Nova:

[upgrade_levels]
compute = auto

Or pin explicitly to the previous release name during the transition, then remove the pin once every node is upgraded.

Online data migrations. After deploying new code, you run background migrations to move data to new formats:

nova-manage db online_data_migrations

This must reach zero remaining before you start the next upgrade. A common failure: starting the next release while migrations from the last one are unfinished.

Step 3: Upgrade order within the control plane

The ordering that’s bitten me when ignored:

Back up every database. mysqldump all OpenStack schemas, verified restorable. Non-negotiable.
Keystone first. Tokens must keep validating throughout.
Sync schema (expand): keystone-manage db_sync --expand, then --migrate.
Glance, Nova, Neutron, Cinder — each with db_sync, new code, RPC pin, online migrations.
Compute nodes last, rolling, so the control plane (already new, RPC-pinned to old) keeps serving the still-old agents.
Remove RPC pins, run contract migrations once every node is new.

Step 4: The compute-node dance

Compute nodes are where “no downtime” is won or lost. The control plane is upgraded and pinned to the old RPC version, so it can talk to both old and new nova-compute. Now roll the computes one (or one availability zone) at a time:

openstack server list --host <compute> --all-projects   # know what's running
# live-migrate or evacuate workloads off if you want zero instance impact
openstack compute service set --disable <compute> nova-compute
# upgrade the node, restart nova-compute
openstack compute service set --enable <compute> nova-compute

Disabling the service first stops the scheduler from placing new instances on a node you’re about to bounce.

Step 5: Verify, then contract

After every node is new, don’t immediately rip out the compatibility scaffolding. Verify first:

openstack compute service list      # all up, all new version
nova-manage db online_data_migrations   # must report 0 remaining

Only then remove the RPC pins and run the --contract schema migrations that drop old columns. Contracting too early — while an old service still expects the old column — is a classic self-inflicted outage.

Using AI to de-risk the plan

Upgrade planning is reading-heavy and detail-heavy, which is exactly where an LLM earns its keep as a planning assistant. I paste the relevant release notes and ask:

“Here are the Nova and Neutron upgrade release notes between release A and release B. Produce an ordered checklist of breaking config changes, required db_sync steps, RPC pinning needed, and online migrations. Flag anything that requires action before I deploy new code. Do not invent steps not supported by these notes.”

That last constraint matters — left loose, models confidently hallucinate migration commands. Grounded in the actual notes, it’s excellent at turning prose into an ordered runbook. I keep these upgrade-planning prompts with my other OpenStack prompts.

Rehearse it, every time

The single highest-leverage practice: rehearse the upgrade in a staging cloud that mirrors production, with a snapshot of production’s database. Most upgrade failures are data-shaped — a migration that chokes on a row your test data never had. A dump-and-restore rehearsal surfaces those before they touch real workloads.

OpenStack upgrades are not scary once you internalize the contract: sequential releases, expand-deploy-migrate-contract, Keystone first and computes last, and nothing torn down until everything’s verified. Back up, rehearse, and order the steps deliberately. For more upgrade and operations prompts, browse our prompt library.

AI-generated upgrade checklists are assistive, not authoritative. Validate every step against the official release notes and rehearse in staging first.