How to Build a Production-Ready OpenStack Cloud (2026 Guide)
Build a production-ready OpenStack cloud: HA control plane, Kolla-Ansible as code, TLS, networking, storage, backups, monitoring, and a tested upgrade path.
- #openstack
- #kolla-ansible
- #private-cloud
- #production
- #devops
A production-ready OpenStack cloud is not openstack running once on a single node. It is a highly available control plane fronted by a VIP, deployed entirely as code with Kolla-Ansible, with TLS everywhere, a deliberate network model, working backups of your databases and Fernet keys, real monitoring, and an upgrade path you have actually rehearsed. Anything short of that is a lab — and the gap between “it boots a VM” and “it survives a node failure at 3am without me” is exactly where most private-cloud projects quietly die.
I’ve built and operated OpenStack clouds for years, both the hand-rolled package way and the Kolla-Ansible way. This guide is the build I’d hand to a team standing up a cloud they intend to keep. It’s opinionated, it uses real commands and real filenames, and it tells you where the bodies are buried. If you want the focused deployment walkthrough first, read Deploying OpenStack with Kolla-Ansible — this piece is the wider production picture around it.
What “production-ready” actually means
Before any hardware, agree on the definition with your team, because it drives every decision below:
- No single point of failure in the control plane. Three controllers, a virtual IP, clustered databases and message queues.
- The cloud is described in version control. If a controller dies, you redeploy it byte-for-byte from
globals.yml, an inventory, andpasswords.yml— not from memory. - TLS on the public (and ideally internal) API endpoints. Keystone tokens and credentials do not cross the wire in plaintext.
- A network model you chose on purpose — provider networks for “real” IPs, tenant networks for self-service, with the physical NICs and VLANs mapped explicitly.
- Storage with a backend you can lose a disk from without losing a volume.
- Backups that you have restored at least once. Untested backups are decoration.
- Monitoring and central logs, so you find out about RabbitMQ partitions before your users do.
- A written, rehearsed upgrade procedure. OpenStack ships every six months; a cloud you can’t upgrade is a cloud with an expiry date.
Reference architecture and node roles
Start with three node roles. You can collapse them on small clouds, but understand what you’re trading away.
Control nodes (3). They run the API services (Keystone, Nova, Neutron, Glance, Cinder, Placement, Horizon), the Galera/MariaDB cluster, RabbitMQ, memcached, and the HAProxy + keepalived pair that owns the VIP. Three is the magic number: Galera and RabbitMQ both need an odd quorum to avoid split-brain. Two controllers is worse than one — it can deadlock. Size them generously: 16+ cores, 64–128 GB RAM, fast NVMe for the database. The control plane is memory- and IOPS-hungry, not CPU-bound.
Compute nodes (N). They run nova-compute, the Neutron OVN controller agent, and the hypervisor (KVM). Scale these horizontally — this is where your tenant workloads live. Size to your workload; overcommit CPU modestly (4:1 is sane), be careful with RAM overcommit (1:1 in production unless you enjoy the OOM killer).
Storage nodes. If you run Ceph (you probably should — see below), these are dedicated OSD hosts. Plan for at least three for replication, more for real capacity. Keep Ceph off your hypervisors at scale; co-located “hyperconverged” setups are fine for small clouds but couple two failure domains.
A minimum credible production footprint is 3 control + 3 compute + 3 storage = 9 nodes, plus a small deploy/bastion host that runs Kolla-Ansible and holds your /etc/kolla config. You can shrink to 3 controllers doubling as storage for a pilot, but write down that you did it.
Networking-wise, give each node at least two physical networks: a management/API network (control plane traffic, internal API, database, RabbitMQ) and a tenant/provider network (VM traffic). A third NIC for storage replication keeps Ceph backfill from starving your API plane. Bond them for redundancy.
Choosing a deployment tool: Kolla-Ansible
Do not hand-install OpenStack in production in 2026. Use a deployment framework so your cloud is reproducible. The two serious open-source choices are Kolla-Ansible (containers + Ansible) and OpenStack-Ansible (LXC/bare packages + Ansible). I default to Kolla-Ansible: every service is a container image, configs are rendered from your variables, and “rebuild this controller” is a single command with a predictable result.
Your entire cloud lives in three files under /etc/kolla:
globals.yml— the deployment-wide configuration.passwords.yml— generated secrets (kolla-genpwd), kept in a vault.multinode— the Ansible inventory mapping hosts to roles.
A production globals.yml has a handful of decisions that matter far more than the rest:
# /etc/kolla/globals.yml
kolla_base_distro: "ubuntu"
openstack_release: "2025.1" # pin it; never "master" in prod
# The control-plane VIP — must be a free IP on the API network
kolla_internal_vip_address: "10.10.0.10"
kolla_external_vip_address: "203.0.113.10"
# TLS on the endpoints
kolla_enable_tls_internal: "yes"
kolla_enable_tls_external: "yes"
kolla_copy_ca_into_containers: "yes"
# HA building blocks (these are the defaults you must NOT disable)
enable_haproxy: "yes"
enable_mariadb: "yes"
enable_rabbitmq: "yes"
om_enable_rabbitmq_quorum_queues: "yes" # durable queues, see HA section
# Networking
neutron_plugin_agent: "ovn"
network_interface: "bond0" # management/API
neutron_external_interface: "bond1" # provider/tenant uplink
# Storage backend (Ceph external cluster shown)
glance_backend_ceph: "yes"
cinder_backend_ceph: "yes"
nova_backend_ceph: "yes"
The single most common mistake here is leaving kolla_internal_vip_address on an interface that isn’t actually a free, routable IP keepalived can claim. The second is not pinning openstack_release. Pin it.
Deploy is then the familiar Kolla-Ansible cadence from your deploy host:
kolla-ansible -i multinode bootstrap-servers
kolla-ansible -i multinode prechecks
kolla-ansible -i multinode deploy
kolla-ansible -i multinode post-deploy # writes /etc/kolla/admin-openrc.sh
prechecks is your friend — it catches the VIP, the interfaces, and quorum problems before they become a half-deployed cloud.
Networking: provider vs tenant, OVN, VLAN/VXLAN
Networking is where OpenStack newcomers lose the most time, so decide deliberately. In 2026 use OVN as your Neutron backend — it replaces the old agent zoo (no more neutron-l3-agent, neutron-dhcp-agent sprawl), pushes logic into OVS/OVN’s distributed gateways, and is what new clouds should run. If you have an old cloud on the legacy ML2/OVS stack, see Migrating Neutron to OVN.
Two network concepts you must hold separately:
- Provider networks map directly onto a real L2 segment in your datacenter (a VLAN, or a flat network). VMs get “real” routable IPs. Use these for anything that needs to be reachable like a normal server.
- Tenant networks are self-service overlays (VXLAN/Geneve) that projects create themselves, with NAT to the outside via floating IPs. This is the “cloud” experience — users carve out their own subnets.
The physical mapping lives in bridge_mappings and your provider config. A typical setup:
# rendered into the OVN/OVS config by Kolla
[ml2_type_vlan]
network_vlan_ranges = physnet1:100:200
[ovn]
ovn_l3_scheduler = leastloaded
Then you create the provider network so projects can attach to it:
openstack network create --provider-network-type vlan \
--provider-physical-network physnet1 --provider-segment 100 \
--external provider-net
openstack subnet create --network provider-net \
--subnet-range 203.0.113.0/24 --gateway 203.0.113.1 \
--allocation-pool start=203.0.113.50,end=203.0.113.200 provider-subnet
If floating IPs or East-West traffic misbehave later, the playbooks I keep open are Debugging Neutron Networking and the floating-IP/NAT specifics. Get the physical network and VLAN trunking right on your switches first — most “Neutron bugs” are switchport bugs.
Storage: Cinder backends, Ceph vs LVM, Glance store
Storage is a one-way door, so choose carefully.
Ceph is the production answer for a general-purpose cloud. One Ceph cluster backs Glance (images), Cinder (volumes), and Nova (ephemeral disks via RBD). The payoff is huge: copy-on-write clones make booting from an image near-instant, live migration works cleanly because the disk is shared, and you can lose a disk (or a node) without losing data. The cost is operational complexity — Ceph is a distributed storage system you now also operate.
LVM (the cinder-volume LVM/iSCSI driver) is fine for a small cloud or a pilot, but it pins each volume to one node. Lose that node, lose access to those volumes; live migration of volume-backed instances gets awkward. I use LVM only when the cloud is small enough that “restore from backup” is an acceptable failure mode.
For Glance, store images in Ceph (glance_backend_ceph: "yes") so they’re shared and CoW-clonable; the local file store works but doesn’t scale past one controller. For Cinder DR, set up volume backups to an off-cluster target — losing the storage cluster shouldn’t lose the backups. The full procedure is in Cinder volume backups and disaster recovery.
A common Ceph integration check after deploy:
openstack volume create --size 10 smoke-test
openstack volume show smoke-test -c status -c volume_type
rbd -p volumes ls # the volume should appear in the Ceph pool
High availability: VIP, Galera, RabbitMQ
This is the section that separates a cloud from a demo.
The VIP (HAProxy + keepalived). All API traffic hits a single virtual IP. keepalived floats that IP across the three controllers; HAProxy load-balances each service behind it. Kolla-Ansible sets this up when enable_haproxy: "yes" and you’ve given it a real kolla_internal_vip_address. Verify the VIP actually fails over: stop keepalived on the active controller and watch the IP migrate.
# on each controller
ip addr show | grep 10.10.0.10 # exactly one controller should own it
docker exec haproxy hatop -s /var/lib/kolla/haproxy/haproxy.sock # backend health
Galera (MariaDB). Your three controllers run a synchronous multi-master Galera cluster. The risk is split-brain: if the cluster loses quorum (a network partition or two nodes down), it stops accepting writes to protect data. That’s correct behaviour, but it means you must understand recovery. Check cluster health regularly:
docker exec mariadb mysql -u root -p"$DBPW" \
-e "SHOW STATUS LIKE 'wsrep_cluster_size';" # should equal 3
If the cluster goes down hard, recover it with kolla-ansible mariadb_recovery — do not improvise with --wsrep-new-cluster unless you know which node had the latest commit.
RabbitMQ. OpenStack services talk over RabbitMQ, and a wedged broker takes the cloud with it. Enable quorum queues (om_enable_rabbitmq_quorum_queues: "yes") — they replicate durably across the cluster and survive node loss far better than the old mirrored/classic queues. Watch for queue buildup, which signals a stuck consumer (a hung agent) long before users complain; I keep Diagnosing RabbitMQ queue buildup close. The cardinal sin is running a single RabbitMQ node “to keep it simple” — when it restarts, every API call hangs.
Security: TLS, Fernet rotation, secrets
TLS. Terminate TLS at HAProxy for the external endpoints at minimum (kolla_enable_tls_external: "yes"), and internal too if your management network isn’t fully trusted. Use a real CA-signed cert for the external VIP so users’ openstack clients don’t need --insecure. Kolla will copy your CA into the containers with kolla_copy_ca_into_containers: "yes".
Keystone Fernet keys. Keystone signs tokens with Fernet keys that must be rotated and must be identical across all three controllers. Kolla handles distribution, but you need a rotation schedule:
kolla-ansible -i multinode keystone-fernet-rotate
# or schedule it; keys live under /etc/kolla/keystone/fernet-keys
Two failure modes to internalize: if the keys drift between controllers, tokens validate on one node and fail on another (intermittent 401s). And if you ever restore Keystone from backup, you must restore the Fernet keys with it, or every existing token instantly becomes invalid.
Secrets. passwords.yml is generated by kolla-genpwd and contains every service credential in your cloud. Treat it like the crown jewels: keep it in a vault (Ansible Vault, SOPS, or HashiCorp Vault), never in plain git. For per-tenant secret storage (TLS certs, keys handed to instances), deploy Barbican — see Securing secrets with Barbican.
A word of caution: do not turn on strict internal mTLS between every service before you’ve run the cloud and audited what talks to what. I’ve watched teams brick a fresh deploy by hardening internal TLS on day one and then spending a week chasing which service couldn’t reach which endpoint. Harden after you have a working, monitored baseline.
Monitoring and logging
You cannot operate what you can’t see. Two pillars:
Metrics. Run Prometheus with the OpenStack exporters and node-exporter on every host. Kolla can deploy a monitoring stack, or point an existing Prometheus at the cloud. The metrics that catch real incidents: RabbitMQ queue depth, Galera wsrep_cluster_size and flow control, HAProxy backend up/down, Nova scheduler failures, and per-host disk/RAM. The full setup is in Monitoring OpenStack with Prometheus.
Logs. Centralize them. Kolla ships a Fluentd → OpenSearch (“central logging”) path; turn it on. When Nova fails to schedule an instance, you want to grep one place for the request ID across nova-api, nova-scheduler, and nova-compute — chasing it across 9 hosts by hand at 3am is how outages get long.
Set alerts on the leading indicators (queue buildup, cluster size < 3, VIP flapping), not just “API is down” — by the time the API is down, you’re already late.
Backups and DR
Here is the uncomfortable truth: most OpenStack outages that become disasters are not hardware failures — they’re a recoverable failure plus missing backups. Back up three things, on a schedule, to off-cluster storage:
- The databases. A consistent dump of the Galera cluster (
mariadb-dump/mysqldumpagainst one node, or a Galera-aware backup). This holds your entire cloud state: projects, instances, networks, volumes metadata. - The deploy config. All of
/etc/kolla—globals.yml,passwords.yml, the inventory, custom configs. This is how you rebuild controllers. - The Keystone Fernet keys. Tiny, easy to forget, catastrophic to lose (see above).
# nightly, from the deploy host or a controller
docker exec mariadb mariadb-dump -u root -p"$DBPW" \
--all-databases --single-transaction > /backup/db-$(date +%F).sql
tar czf /backup/etc-kolla-$(date +%F).tgz /etc/kolla
# ship both to object storage / off-box target, then verify the copy
Then — and this is the part everyone skips — restore into a staging cloud at least once. A backup you haven’t restored is a hypothesis. Plan your DR around the question “a controller’s disks are gone, go”: you should be able to redeploy that controller from /etc/kolla and have the cluster heal.
The upgrade path
OpenStack releases every six months. A cloud you can’t upgrade is technical debt with a countdown timer, because security fixes and support follow the release train.
The rules I operate by:
- Never skip-release upgrade. Kolla-Ansible and the OpenStack services support upgrading one release at a time (e.g., 2024.1 → 2024.2 → 2025.1). Jumping two or more releases skips the intermediate database migrations and is how you end up with a cloud that won’t start. If you’re behind, do it serially.
- Back up before you touch anything — full DB dump and
/etc/kolla, verified, immediately before the upgrade. - Bump
openstack_releaseinglobals.yml, pull the new images, then run the upgrade playbook, which does the rolling, ordered service upgrade and DB migrations for you:
# edit globals.yml: openstack_release: "2025.2"
kolla-ansible -i multinode pull
kolla-ansible -i multinode upgrade
- Rehearse in staging first. Upgrade a clone of your config and a copy of the DB before you touch production.
I wrote up the full procedure, including the pre-flight checklist and how to roll back, in Planning OpenStack upgrades safely. Read it before your first production upgrade, not after your first failed one.
Day-2 operations
Standing the cloud up is week one. Operating it is forever. The Day-2 muscles to build:
- Capacity and quotas. Track headroom and set per-project quotas before someone’s runaway Terraform consumes the cluster. See Managing quotas and capacity planning.
- Runbooks. Write down the recovery procedures (Galera recovery, RabbitMQ restart order, VIP failover) before the incident, not during it.
- Patching and config drift. Because the cloud is code, config changes go through
globals.ymlandkolla-ansible reconfigure— never edit a container’s config by hand and expect it to survive. - On-call. Someone owns the pager. A cloud without an owner is an outage waiting for a date.
Common pitfalls
These are the ones I see kill production clouds:
- Skipping HA “for now.” Single controller, single RabbitMQ, two-node Galera. It works in the demo and falls over the first time a node reboots. Build three controllers from day one or accept that you have a lab.
- No backups of the control plane. Teams back up tenant volumes and forget the Galera DB and Fernet keys — the two things that are the cloud. Back up the control plane first.
- Strict mTLS before you’ve audited. Hardening internal TLS on a fresh deploy hides which service can’t reach which endpoint. Get a monitored baseline working, then harden.
- Skip-release upgrades. Jumping multiple OpenStack releases skips DB migrations. One release at a time, always.
- Understaffing. OpenStack is a real distributed system: databases, message queues, storage, networking. It rewards a team that owns it and punishes a side-project. Budget the people, not just the hardware.
- VIP on the wrong interface. keepalived needs a genuinely free, routable IP. Half-deployed clouds usually trace back to this.
AI and prompts for OpenStack operations
OpenStack throws dense, cross-service errors — a Nova scheduling failure that’s really a Placement inventory problem that’s really a Neutron port-binding problem. This is exactly where an LLM earns its keep: paste the request ID’s log trail from nova-api, nova-scheduler, and nova-compute and ask it to correlate the failure across services and propose the next diagnostic command. I lean on Claude for this because it’s strong at reasoning over long, multi-file log context.
I keep a library of OpenStack operator prompts — for reading Placement inventories, decoding Neutron port-binding errors, and drafting diagnostic runbooks — in the general prompts library, and a curated, deployment-focused set in the OpenStack Prompt Pack if you want them packaged and ready to drop into your on-call workflow. For the always-on version, the incident-response dashboard turns an alert plus its log context into a first-draft remediation while you’re still reaching for coffee.
If you’d rather have a second set of expert eyes on the actual build, I do fixed-price OpenStack and Kolla-Ansible audits — control-plane HA, networking, backups, and upgrade-readiness reviewed against this exact checklist. Details are on work with me.
FAQ
Is OpenStack still relevant in 2026? Yes — arguably more than five years ago. With repatriation from public cloud, sovereignty and data-residency requirements, and the cost of AI/GPU workloads at scale, organizations want a private cloud they control. OpenStack is the mature, open-source standard for it, and Kolla-Ansible has made deployment dramatically more approachable. It is not legacy; it’s the default for serious on-prem IaaS.
Kolla-Ansible vs OpenStack-Ansible — which should I use? Both are excellent and production-proven. Kolla-Ansible runs every service in a container and is what I default to: clean upgrades (pull new images), reproducible deploys, easy rollback. OpenStack-Ansible runs services in LXC/bare and gives you more granular host-level control. If you have no strong existing preference, pick Kolla-Ansible — the container model makes the build-as-code and upgrade story simpler.
How many nodes do I really need? For HA, three control nodes is the floor — that’s a hard requirement for Galera and RabbitMQ quorum, not a nice-to-have. Add compute nodes to taste (start with two or three) and, if you run Ceph, three storage nodes for replication. A credible production minimum is around 9 nodes plus a deploy host. You can run a 3-node pilot with collapsed roles, but document that it’s a pilot.
Is Ceph required? No, but it’s strongly recommended for any general-purpose cloud. Ceph gives shared storage across Glance, Cinder, and Nova, which unlocks fast image-based boots, clean live migration, and survival of disk/node failures. LVM is acceptable for small clouds or pilots where a volume pinned to one node and “restore from backup” is an acceptable failure mode. Choose Ceph if you intend to scale or care about uptime.
Can I add HA to an existing single-controller cloud later? You can, but it’s a real migration, not a flag flip — you’re introducing a VIP, growing Galera and RabbitMQ to a cluster, and reconfiguring endpoints. It’s far cheaper to build three controllers up front. If you’re already single-node in production, plan the HA cutover as a dedicated, rehearsed project.
Conclusion
A production-ready OpenStack cloud is the sum of unglamorous decisions made early: three controllers behind a VIP, the whole cloud in globals.yml, TLS on the endpoints, a network model you chose on purpose, Ceph under your volumes, tested backups of the database and Fernet keys, real monitoring, and an upgrade you’ve rehearsed. Get those right and OpenStack is a boringly reliable private cloud you can run for years. Skip them and you have a demo that fails the first time a node reboots.
Build it as code, back up the control plane, never skip a release on upgrade, and staff it like the distributed system it is. If you want the deployment mechanics next, start with Deploying OpenStack with Kolla-Ansible, and browse the rest of the OpenStack library for the per-service deep dives.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.