Scaling Nova with Cells v2 in OpenStack

Every Nova deployment is a cells v2 deployment — even your tiny dev cloud has exactly one cell. Most operators never think about it because the single-cell default just works up to a few hundred compute nodes. But the moment you’re pushing past that, the Nova database and the RabbitMQ cluster become the bottleneck for the whole cloud, and cells v2 is the answer: shard them.

I’ve planned and operated multi-cell clouds, and the architecture is elegant once it clicks. It’s also the kind of thing you want to understand before you’re forced into it. Here’s the practical version.

What cells v2 actually shards

The key insight: cells v2 splits the parts of Nova that don’t scale (the cell database and the cell message queue) while keeping a single global API.

The API layer has its own database (nova_api) holding global data: flavors, quotas, the instance-to-cell mapping. There’s one of these for the whole cloud.
Each cell has its own nova database and its own RabbitMQ, plus nova-conductor and the compute nodes assigned to it.
cell0 is special — a graveyard database for instances that failed to schedule. It never has compute nodes.

So a request hits the global API, the scheduler picks a cell, and the instance lives in that cell’s database and talks over that cell’s queue. Add a cell, add capacity, without touching the existing cells’ databases or queues. That’s the whole point.

Inspecting your current topology

Even on a single-cell cloud, look at what exists:

# List cells (you'll see cell0 and cell1 by default)
nova-manage cell_v2 list_cells --verbose

# See which hosts belong to which cell
nova-manage cell_v2 list_hosts

cell1 is your default real cell. cell0 holds failed builds. If you’ve never added a cell, everything compute lives in cell1.

Adding a second cell

The flow for a new cell: create its database and queue, register it with the API, then point new computes at it.

# 1. Create the cell, giving it its own DB and transport URL
nova-manage cell_v2 create_cell \
  --name cell2 \
  --database_connection 'mysql+pymysql://nova:PW@db-cell2/nova' \
  --transport-url 'rabbit://nova:PW@mq-cell2:5672/' \
  --verbose

# 2. Bring up nova-compute on hosts configured for cell2's queue/db,
#    then discover them into the cell
nova-manage cell_v2 discover_hosts --cell_uuid <cell2-uuid> --verbose

The critical operational habit: discover_hosts must run whenever you add compute nodes. New computes register in their cell’s database but the API doesn’t know about them until discovery maps them. Forgetting this is the single most common “why won’t the scheduler use my new nodes” ticket. Automate it — either run it on a timer or set [scheduler] discover_hosts_in_cells_interval so the API auto-discovers:

[scheduler]
discover_hosts_in_cells_interval = 300

Why you’d shard: the real triggers

Don’t add cells for fun — each one is more databases and queues to operate. The triggers that actually justify it:

RabbitMQ saturation. Above a few hundred computes, a single RabbitMQ cluster becomes the cloud’s heartbeat and its weakest link. Per-cell queues isolate the blast radius — one cell’s queue storm doesn’t take down the others.
Nova DB contention. The nova database under thousands of instances gets hot. Sharding spreads the load.
Failure isolation. A cell is a blast-radius boundary. Lose a cell’s database and you lose that cell, not the cloud.

That isolation property is, honestly, the reason I’d reach for cells even before raw scale forces it on a large cloud.

The operational gotchas

Running multi-cell surfaces real edges:

Cross-cell operations are limited. Cross-cell live migration and resize exist but are newer and fussier than same-cell. Design placement so instances rarely need to cross cells.
Listing instances fans out. A global openstack server list queries every cell. With many cells and nova-api doing the fan-out, this gets slow. Filter by project/cell where you can.
cell0 fills with failed builds. Periodically archive it. A bloated cell0 makes API queries crawl.
Up-calls. Computes in a cell generally shouldn’t need to reach the API database. Features that require “up-calls” (some affinity, some quota checks) behave differently across cells — read the release notes for your version.

I keep an AI prompt that takes nova-manage cell_v2 list_cells output plus a scheduler log snippet and tells me whether a “no valid host” failure is a discovery gap, a cell mapping issue, or genuine capacity — it separates those three faster than I can grep. A few of these are in our prompt library.

My scaling roadmap

How I think about the journey:

< 200 computes: one cell, don’t overthink it. Tune RabbitMQ and the Nova DB first.
200–500: start planning cell2; get your deployment tooling cell-aware before you need it.
500+: multiple cells, per-cell queues and databases, automated discover_hosts, and cell-as-failure-domain baked into your placement strategy.

Crossing from one cell to two is the hard step because your deployment automation has to learn the topology. Do it while you have headroom, not in a capacity crisis.

Where to go next

Cells v2 is how Nova scales to the thousands of nodes the big public clouds run. The architecture is clean — global API, sharded cell databases and queues — but the operations require discipline: always discover new hosts, archive cell0, and design to keep instances inside their cell. Plan the second cell before scale forces it. For the RabbitMQ tuning and Placement service that underpin a multi-cell cloud, see the OpenStack category.

Multi-cell topology changes are high-impact. Validate cell creation and host discovery in a staging cloud before sharding production Nova.