Managing Quotas and Capacity Planning in OpenStack

Two of the most common OpenStack tickets are opposites of each other: “I have plenty of quota but can’t launch an instance” and “the cluster has free hardware but everyone’s hitting limits.” Both come down to the same thing — most operators set quotas once, never reconcile them against real capacity, and don’t understand the overcommit math the scheduler is actually doing. After years of running OpenStack at capacity, I’ve learned that quotas and capacity planning are the same problem viewed from two ends. Here’s how to keep both honest.

Two different ceilings: quota vs. capacity

There are two independent limits on launching an instance. Quota is a per-project accounting limit Keystone/Nova enforces before scheduling — “this tenant may use 50 vCPUs.” Capacity is whether a host can actually fit the instance, decided by the Placement service tracking real resource inventory and allocations. You can pass quota and fail capacity (“No valid host was found”), or have capacity and fail quota (“Quota exceeded”). Diagnosing the wrong one wastes hours, so always identify which ceiling you hit first.

Step 1: Audit quotas and find drift

Check a project’s quota and its actual usage together:

openstack quota show <project-id>
openstack quota list --detail --project <project-id>
openstack limits show --absolute --project <project-id>

limits show --absolute is the one that matters — it shows the limit and the current in-use count side by side, so you immediately see whether the project is actually near its ceiling. Quota drift is real: a failed delete or an orphaned resource can leave Nova’s usage counter higher than the resources that actually exist, and the project hits “quota exceeded” with phantom usage. To fix counted-but-missing usage, recount:

# Confirm real resources vs. counted usage, then if drift is confirmed:
openstack server list --project <project-id> --all-projects
nova-manage placement heal_allocations --dry-run

Modern Nova counts usage from live resources rather than a separate counter, which reduces drift, but allocation records in Placement can still go stale — heal_allocations reconciles them.

Step 2: Understand the overcommit math

When a tenant has quota but gets “No valid host was found,” it’s a Placement/scheduler capacity decision. The trap is overcommit. Nova advertises more CPU and RAM than physically exist, controlled by allocation ratios:

# nova.conf (or per-host / per-resource-provider in Placement)
[DEFAULT]
cpu_allocation_ratio = 4.0
ram_allocation_ratio = 1.5
disk_allocation_ratio = 1.0

A cpu_allocation_ratio of 4.0 means a 32-core host advertises 128 vCPUs. RAM at 1.5 is aggressive — RAM doesn’t compress like CPU time, so over-committing it gets you OOM kills under load. I run CPU hot (4–8x for general workloads), RAM conservative (1.0–1.2 in production), and disk at 1.0 unless I’m on thin-provisioned storage. The point is to choose these consciously, because the defaults rarely match your risk tolerance.

Step 3: Read real capacity from Placement

Placement is the source of truth for what’s actually allocatable. Query it directly instead of guessing from hypervisor-list:

openstack resource provider list
openstack resource provider inventory list <provider-uuid>
openstack resource provider usage show <provider-uuid>

inventory list shows total, reserved, and allocation ratio per resource class (VCPU, MEMORY_MB, DISK_GB); usage show shows what’s allocated. The gap between them is your real headroom — after overcommit. This is the number that determines whether the next instance schedules, and it’s the number capacity planning should track over time.

The aggregate view from Nova is still useful for a quick read:

openstack hypervisor stats show

But remember it reports overcommitted totals, so “vcpus_used 200 / vcpus 512” might mean you’re actually out of physical CPU if your ratio is high.

Step 4: Diagnose “No valid host was found”

This generic error means the scheduler filtered out every host. Turn on the reasoning:

journalctl -u nova-scheduler | grep -i 'filter\|no valid host'

The scheduler log tells you which filter eliminated the hosts — RamFilter, ComputeCapabilitiesFilter, AggregateInstanceExtraSpecsFilter (flavor extra-specs not matching any aggregate), PciPassthroughFilter, etc. Nine times out of ten it’s one of: genuinely out of a resource after overcommit, a flavor with extra-specs (like a specific host aggregate or NUMA topology) that no host satisfies, or anti-affinity that can’t be honored. The filter name points straight at it.

Step 5: Capacity planning that actually prevents outages

Trend the Placement usage numbers, not the raw hardware. Track allocatable-after-overcommit headroom per resource class per aggregate over time, and set a refill threshold (I order hardware when sustained headroom drops below 20%). Watch for the imbalance case too: one aggregate full while another sits idle because flavors or availability zones pin workloads. Rebalancing flavors/aggregates often buys more runway than buying hardware.

Where AI helps

Capacity questions involve reconciling several numbers that all mean slightly different things — quota limits, counted usage, real resources, overcommitted totals, and Placement inventory. I’ll paste limits show, the Placement inventory/usage, and the scheduler filter log into a model and ask it to answer one question: is this a quota ceiling, a real-capacity ceiling, an overcommit/flavor mismatch, or drift? Getting that classification right in ten seconds instead of an hour is the whole game.

Keep a saved capacity triage prompt and read our other OpenStack guides — the Nova troubleshooting one pairs naturally with this. The model reconciles the numbers and reasons about which ceiling you hit; you run every command and you make the overcommit and hardware decisions yourself.

Generated commands and configs are assistive, not authoritative. Always verify against your own cluster before changing allocation ratios or quotas in production.